fyOJi  Ars3¥.t'f*^^-CF 


Computing  Science 
and  Statistics  ad-a252  933 


Proceedings  of  the 
23rd  Symposium  on  the  Interface 


Critical  Applications  of  Scientific  Computing: 
Biology,  Engineering,  Medicine,  Speech... 

April  21-24, 1991 

Elaine  M.  Keramidas 

Editor 


This  document  has  been  appiored 
foe  public  lelaase  and  sole;  it* 
distribution  is  unlimite<L 


INTERFACE 
FOUNDATION 
OF  NORTH  AMERICA 


MASTER  COPT 


THIS  COPY  FOR  REPROOUCTIOH  PURPOSES 


Form  Approved 
0MB  No.  0704-018B 


^taU^itponinq  burdwi  (or  ihn  collwtion  of  information  i%  ntimatad  to  avara^t  t  Hour  par  raspoma.  mctudinq  tha  tima  for  rav««nng  instructiom.  searching  aantm^  Oau  wurcai. 


coHaction  of  information,  including  tuggaitions  for  raducing  this  burdan.  to  Watfungton  Haadquartan  farvicat.  Oiractorata  for  information  Oparatiom  and  daports.  tii5  Jaffanon 
OavnHighwav.  Suita  1204.  Arlington.  V  A  22202-4302,  and  to  thaOffkaofManagamant  and  *udgat.PapaniiorkHaductionaroiact(0704-0in).Waitiington.  DC  20503. 


1.  AGENCY  USf  ONLY  (Ltavo  btank)  I  2.  REPORT  DATE  i  3.  REPOKT  TYPE  AND  OATES  COVERED 

I  ,000  I  Final  15  Mar  91-  lA  Mar  92 


4.  TTOE  AND  SUBTITLE 


S.  FUNDING  NUMBERS 


Computing  Science  and  Statistics,  Interface '91 


«.  AUTHOR(S) 


DAAL03-91-G-0085 


J.R.  Kettenring  (principal  investigator) 


7.  PERFORMING  ORGANI2ATION  NAME(S)  AND  AOORESS(ES) 

Interface  Foundation  of  North  America,  Inc. 
Fairfax  Station,  VA  22039-7460 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


9.  SPONSORING /MONITORING  AGENCY  NAME(S)  ANO  AOORESS(ES) 

U.  S.  kcmy  Research  Office 
P.  0.  Box  12211 

Research  Triangle  Park,  NC  27709-2211 


10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 

ARO  28534. 1-MA-CF 


11.  SUPPLEMENTARY  NOTES 

The  view,  oplnlona  and/or  findings  contained  in  this  report  are  those  of  the 
author(s)  and  should  not  be  construed  as  an  official  Department  of  the  Army 
position,  policy,  or  decision,  unless  so  designated  by  other  documentation. 


12«.  DISTRIBUTION /AVAILABIUTY  STATEMENT  12b.  DISTRIBUTION  CODE 

Approved  for  public  release;  distribution  unlimited. 


13.  ABSTRACT  (Maximum 200 worrH) 

The  extremely  successful  workshop  on  Computational  Molecular  Biology  featured  world- 
renowned  speakers.  The  workshop  served  as  a  focused  example  of  the  very  real 
interface  between  biology,  statistics,  and  computing  science.  Much  of  the  success 
of  a  conference  can  be  measured  in  terms  of  the  number  of  attendees  and  the  number 
of  contributed  talks,  which,  for  this  Symposium,  were  approximately  400  and  116, 
respectively.  These  Proceedings  Include  78%  of  the  contributed  papers  and  65% 
of  the  invited  papers  that  were  given  in  Seattle  -  a  more  than  adequate 
representation  of  the  work  presented  at  the  Symposium. 


14.  SUBJECT  TERMS 

Symposium,  Scientific  Computating,  Computing  Science, 
Statistics,  Biology 


IS.  NUMBER  OF  RAGES 


IB.  PRICE  CODE 


17.  SECURITY  OASSIFKATION  I  IB.  SECURITY  CLASSIFICATION  119.  SECURITY  CLASSIFICATION  I  20.  LIMITATION  OF  ABSTRACT 

OP  REPORT  I  OF  THIS  PAGE  I  OF  ABSTRACT  I 


UHCLASSIFIED 


7S404)1-280-S500 


UNCLASSIFIED 


UNCLASSIFIED 


Standard  Form  298  (Rav.  2-89) 

PrtKriMd  b*  ANSI  SM  23«-ia 


COMPUTING  SCIENCE 
AND  STATISTICS 


Proceedings  of  the 
23rd  Symposium  on  the  Interface 

Seattle,  Washington,  April  21-24, 1991 


Editor 

ELAINE  M.  KERAMIDAS 

Bellcore,  Morristown,  New  Jersey 

Assistant  Editor 
SELMA  M.  KAUFMAN 
Bellcore,  Morristown,  New  Jersey 


92-09913 


INTERFACE  FOUNDATION  OF  NORTH  AMERICA 

92  4  17  042 


The  papers  and  discussions  in  this  Proceedings  volume  are  reproduced  exactly  as  received  from 
the  authors.  These  presentations  are  presumed  to  be  essentially  as  given  at  the  23rd  Symposium 
on  the  Interface.  This  Proceedings  volume  is  not  copyrighted  by  the  Interface  Foundation  of 
North  America.  Publication  in  the  Proceedings  does  not  preclude  publication  elsewhere. 


AVAILABILITY  OF  PROCEEDINGS 


22nd 

(1990) 


20, 21st 
(1988, 1989) 


Springer-Verlag  New  York,  Inc. 
175  Fifth  Ave. 

New  York,  New  York  10010 

American  Statistical  Association 
1429  Duke  Street 
Alexandria,  VA  22314-3402 


also  Interface  Foundation 

P.O.  Box  7460 

Fairfax  Station,  VA  22039-7460 

18, 19th  ASA 

(1986, 1987)  1429  Duke  Street 

Alexandria,  VA  22314-3402 


Interface  Foundation  of  North  America,  Inc. 
P.O.  Box  7460 

Fairfax  Station,  VA  22039-7460 


PRINTED  THE  U.S.A. 


PREFACE 


1991  Interface  Proceedings 


The  23rd  Symposium  on  the  Interface  between  Computing  Science  and  Statistics  was  held  on  April  21- 
24,  1991,  at  the  Seattle  Sheraton  Hotel,  Seattle,  Washington.  The  conference  theme  was  "Critical 
Applications  of  Scientific  Computing:  Biology,  Engineering,  Medicine,  Speech...”.  The  Symposium  was 
preceded  by  a  workshop  on  Computational  Molecular  Biology. 

Bellcore  hosted  the  Symposium  with  Jon  R.  Kettenring  serving  as  Program  Chair.  He  assembled  an 
outstanding  program  with  a  committee  that  selected  topics  and  invited  speakers  who  collectively  made 
the  Symposium  a  forum  for  the  exchange  of  exciting  new  ideas  and  provided  a  spectmm  of  applications 
for  scientific  computing.  The  members  of  the  program  committee  were  Mary  Ellen  Bock,  Andreas  Buja, 
William  DuMouchel,  Nicholas  Fisher,  Gene  Golub,  Joe  Hill,  John  McDonald,  John  Nash,  Daryl 
Pregibon,  Werner  Stuetzle,  Michael  Tarter,  Luke  Tierney,  Paul  Tukey,  Paul  Young,  and  myself.  John 
Nash  devoted  much  time  and  effort  to  organizing  a  special  multi-media  session  comprised  of  posters, 
videos,  and  demonstrations.  Tutorials  were  presented  by  Joe  Hill,  William  Eddy  and  Mark  Schervish. 


The  extremely  successful  workshop  on  Computational  Molecular  Biology  was  organized  by  Simon 
Tavare  and  featured  world-renowned  speakers.  The  workshop  served  as  a  focused  example  of  the  very 
real  interface  between  biology,  statistics,  and  computing  science.  This  theme  was  evident  in  the  keynote 
address,  "Opportunities  for  Statisticians  and  Computer  Scientists  in  Biology",  that  was  presented  by  Eric 
Lander.  Burton  Smith  transported  those  that  attended  the  banquet  into  the  computing  world  of  tomorrow 
by  speaking  on  "Future  Supercomputing".  The  talks  presented  in  the  workshop  as  well  as  the  keynote  and 
banquet  addresses  are  not  included  in  these  Proceedings. 

Much  of  the  success  of  a  conference  can  be  measured  in  terms  of  the  number  of  attendees  and  the  number 
of  contributed  talks,  which,  for  this  Symposium,  were  approximately  400  and  116,  respectively. 
However,  a  significant  indicator  of  the  lasting  enthusiasm  that  remains  with  the  speakers  after  a 
conference  has  ended  is  their  commitment  to  undertake  the  task  of  completing  the  manuscripts  that  will 
comprise  the  proceedings  of  that  conference.  These  Proceedings  include  78%  of  the  contributed  papers 
and  65%  of  the  invited  papers  that  were  given  in  Seattle  -  a  more  than  adequate  representation  of  the 
work  presented  at  the  Symposium. 


Organizing  such  a  conference  is  an  Herculean  feat  that  necessarily  requires  the  cooperation  and 
dedication  of  many  people,  I  would  like  to  thank  all  of  those  people  at  Bellcore  and  the  University  of 
Washington  who  assisted  in  a  myriad  of  ways.  I  would  also  like  to  thank  S^lma  Kaufman  for  serving  as 
Assistant  Editor  of  these  Proceedings. 


Accesion  For 


Elaine  M.  Keramidas,  Editor 
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Workshop  in  Computational  Molecular  Biology 


9:00-9:05 

9:05-9:20 

9:20-10:00 

10:00-10:40 

10:40-11:10 

11:10-11:50 

11:50-12:30 

12:30-1:50 

1:50-2:30 

2:30-3:10 

3:10-3:40 

3:40-4:20 

4:20-5:00 


S.  Tavare,  University  of  Southern  California. 

Introduction 

D.  Galas,  Department  of  Energy. 

Overview 

F.  Cohen,  UC  San  Francisco. 

Computational  aspects  of  the  protein  folding  problem 

T.  Schlick,  NYU. 

New  computational  techniques  for  computing  biomolecular 
structures  and  their  dynamics 

Coffee  Break 

E.  Branscomb,  Lawrence  Livermore  National  Labs. 

Building  physical  genome  maps  by  random  clone  overlap; 
a  progress  assessment  of  work  on  human  chromosome  19 

E.A.  Thompson,  University  of  Washington. 

Monte  Carlo  methods  for  linkage  analysis  and  complex 
models 

Lunch 

E.S.  Lander,  Whitehead  Institute. 

Dissecting  complex  inheritance:  statistical  and  computational 
issues 

E.  Myers,  University  of  Arizona. 

Practical  and  theoretical  advances  in  sequence  comparison 
Coffee  Breeik 

R.J.  Roberts,  Cold  Spring  Harbor  Labs. 

Error  detection  in  DNA  sequences 

M.S.  Waterman,  University  of  Southern  California. 

Computer  methods  for  locating  kinetoplastid  cryptogenes 


SYMPOSIUM  SCHEDULE 


Sunday,  April  21, 1991 

8:00  a.m.  -  9:00  a.m.  Registration  (Pre  -  Function  Area) 

9:00  a.m.  -  5:00  p.m.  Workshop  on  Computational  Molecular  Biology  (.sp  .7en) 

5:00  p.m.  -  8:00  p.m.  Registration  (Area  in  front  of  Metropolitan  Ballroom) 

5:00  p.m.  -  8:00  p.m.  Board  of  Directors’  Business  Meeting  and  Dinner  (Cedar) 

8:00  p.m.  - 10:00  p.m.  Opening  Reception  (Metropolitan  Ballroom) 

Monday,  April  22, 1991 

8:30  a.m.  -  9:45  a.m.  Keynote  Address:  "Opportunities  for  Statisticians  and  Computer  Scientists  in  Biology" 
(Grand  Ballroom  C) 

9:45  a.m.  - 10:15  a.m.  Break  (Grand  Ballroom  A) 

10:15  a.m.  - 12:00  p.m.  Invited  A  :  Speech  and  Language  (Grand  Ballroom  B) 

Invited  B  :  Scientific  Computing  Problems  in  the  Aircraft  Industry  (Metropolitan  Ballroom) 
Invited  C  :  Uncertainty  and  Graphical  Models  (West  Ballroom) 

Contributed  A  :  Statistical  Graphics  (Douglas) 

Contributed  B  :  Multivariate  Analysis  (Juniper) 

Contributed  C  :  Random  Number  Generators  -  Simulation  (Madrona) 

12:(X)  p.m.  -  2:(X)  p.m.  Lunch 

2:00  p.m.  -  3:45  p.m.  Invited  A  :  Relational  Databases:  A  Tutorial  for  Statisticians  (Grand  Ballroom  B) 

Invited  B  :  Computing  Problems  in  Environmental  and  Industrial  Statistics  (West  Ballroom) 

Contributed  A  :  Software  Testing  (Douglas) 

Contributed  B  :  Computing  and  Graphics  in  Applications  (Juniper) 

Contributed  C  :  Robusmess  (Madrona) 

3:45  p.m.-  4:15  p.m.  Break  (Grand  Ballroom  A) 

4:15  p.m.-  6:(X)p.m.  Invited  A:  Massive  Databases  (Metropolitan  Ballroom) 

Invited  B  :  Engineering  Applications  of  Computing-Intensive  Methods  (West  Ballroom) 
Invited  C  :  Computational  Methods  in  Spatial  Statistics  (Grand  Ballroom  B) 

Contributed  A  :  Artificial  Intelligence  -  Belief  Functions  (Douglas) 

Contributed  B  :  Issues  in  Interactive  Graphics  (Juniper) 

Contributed  C :  Time  Series  Prediction  -  Function  Estimation  (Madrona) 

Tuesday,  April  23, 1991 

8:00  a.m.  -  9:45  a.m.  Invited  A :  Computationally  Intensive  Methods  for  Discrete  Data  (Grand  Ballroom  C) 
Invited  B  :  Data  Visualization  and  Sonification  (Grand  Ballroom  B) 

Contributed  A  :  Classification  -  Density  Estimation  (Douglas) 

Contributed  B  :  Statistical  Inference  (Juniper) 

Contributed  C  :  Genetics  -  DNA  (Madrona) 
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9:45  a.m.  - 10:15  a.in.  Break  (Grand  Ballroom  A) 


10:15  a.m.  - 


12:00  p.m.  - 
2:00  p.m.  - 


3:45  p.m.  - 
4:15  p.m.  - 


6:30  p.m.  - 
7:30  p.m.  - 

8:00  a.m.  - 


9:45  a.m.  - 
10:15  a.m. 


12:00  p.m. 


12:(X)p.m. 


2:00  p.m. 
3:45  p.m. 


4:15  p.m. 
6:00  p.m. 


7:30  p.m. 
10:(X)  p.m. 


9:45  a.m. 


10:15  a.m. 
12:00  p.m. 


Invited  A  :  Realistic  Rendering ;  A  Tutorial  for  Statisticians  (Grand  Ballroom  B) 

Invited  B  :  Computer  Modeling,  Experimental  Design  and  Data  Analysis  (Grand  Ballroom  Q 

Contributed  A  :  Neural  Nets  -  Biological  Systems  (Douglas) 

Contributed  B  :  Bootstrap  and  Related  Methods  (Juniper) 

Contributed  C  :  Optimization  -  Genetic  Algorithms  (Madrona) 

PosterA^ideo/Demo  Session  (.sp  .7en) 

Invited  A  :  Virtual  Interface  Technology  (Grand  Ballroom  C) 

Invited  B  :  Neural  Networks  (Grand  Ballroom  B) 

Invited  C :  Computational  Statistical  Genetics  (East  Ballroom) 

Contributed  A  :  Tree-Based  Methods  (Douglas) 

Contributed  B  :  Information  Retrieval  -  Record  Linkage  (Juniper) 

Contributed  C  :  Allocation  Problems  -  Sequential  Design  (Madrona) 

Break  (Grand  Ballroom  A) 

Invited  A  :  Dynamic  Statistical  Graphics  (Grand  Ballroom  B) 

Invited  B  :  Research  Opportunities  at  the  Interface  of  Biology,  Statistics  and  Computing 
(Grand  Ballroom  C) 

Contributed  A  :  Integration  -  Probability  Computations  (Douglas) 

Contributed  B  :  Databases  and  Information  Processing  (Juniper) 

Contributed  C  :  Problems  Relating  to  Skewness  and  Kurtosis  (Madrona) 

Reception  (Pre  -  Function  Area) 

Banquet  (Grand  Ballroom  C) 

Banquet  Address:  "Future  Supercomputing" 

Wednesday,  April  24, 1991 

Invited  A  :  Computational  Problems  in  Biomedical  Imaging  (Grand  Ballroom  B) 

Invited  B  :  Parallel  Computing:  A  Tutorial  for  Statisticians  (Grand  Ballroom  C) 

Contributed  A  :  Spatial  Data  -  Shape  Analysis  (Douglas) 

Contributed  B  :  Progratruning  Environments  (Juniper) 

Contributed  C  :  Estimation  Problems  I  (Madrona) 

Break  (Grand  Ballroom  A) 

Invited  A  :  Multivariate  Statistics  and  Visualization  for  Labelled  Point  Data  (Grand  Ballroom  C) 
Invited  B  ;  Statistical  Computing  Environments  for  the  21st  Centuiy  (Cirrus) 

Invited  C  :  Bayesian  Computing  (Grand  Ballroom  B) 

Contributed  A  :  Image  Analysis  (Douglas) 

Contributed  B  :  Applications  Areas  (Juniper) 

Contributed  C  :  Estimation  Problems  II  (Madrona) 

End  of  Conference 
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Abstract 


We  describe  here  the  application  of  classification  and  re¬ 
gression  trees  to  some  problems  in  speech  and  language.  We 
begin  with  a  brief  overview  of  the  technique.  We  then  de¬ 
scribe  their  application  to: 

(1)  End  of  sentence  detection:  The  not-so-simple  prob¬ 
lem  of  deciding  when  a  period  in  text  corresponds  to  the 
end  of  a  declarative  sentence  (and  not  an  abbreviation)  is 
produced  with  trees  using  the  Brown  corpus  as  input.  The 
result  is  99.8%  correct  classification. 

(2)  Segment  duration  modelling  in  speech  synthesis:  400 
utterances  from  a  single  speaker  and  4000  utterances  from 
400  speakers  were  used  to  build  decision  trees  that  predict 
segment  durations  based  on  features  such  as  lexical  position, 
stress,  and  phonetic  context.  Over  70%  of  the  durational 
variance  for  the  single  speaker  and  over  60%  for  the  multiple 
speakers  was  accounted  by  these  methods. 

(3)  Phoneme  to  phone  prediction:  A  lattice  of  possible 
close  phonetic  transcriptions  given  a  phonemic  transcription 
(from  the  orthography  and  a  dictionary)  is  produced  using 
the  4000  TIMIT  database  as  input.  The  most  likely  phone 
corresponding  to  a  phoneme  can  be  predicted  83%  correctly. 
The  five  most  likely  phones  can  be  predicted  99%  correctly. 


1.  Introduction 


Several  applications  of  statistic2d  tree-based  modelling 
are  described  here  to  problems  in  speech  and  language.  Clas¬ 
sification  and  regression  trees  are  well  suited  to  many  of  the 
pattern  recognition  problems  encountered  in  this  area  since 
they  (1)  statistically  select  the  most  significant  features  in¬ 
volved,  (2)  provide  “honest”  estimates  of  their  performance, 
(3)  permit  both  categorical  and  continuous  features  to  be 
considered,  and  (4)  allow  human  interpretation  and  explo¬ 
ration  of  their  result.  First  the  method  is  summarized,  then 
its  application  to  end-of-sentence  detection  in  text,  phonetic 
segment  duration  prediction,  and  phoneme-to-phone  classi¬ 
fication  are  described.  We  conclude  with  some  general  re¬ 
marks  on  the  strengths  and  shortcomings  of  this  method. 
For  other  applications  to  speech  and  language,  see  [Lucassen 
1984),  [Bahl,  et  al  1987]. 


2.  Classification  and  Regression  Trees 

An  excellent  description  of  the  theory  and  implementa¬ 
tion  of  tree-based  statistical  models  can  be  found  in  Classifi¬ 
cation  and  Regression  Trees  [L.  Breiman,et  al,  1984).  A  brief 
introduction  to  these  ideas  will  be  provided  in  this  section 
for  those  who  may  not  be  familiar  with  them. 

Consider  the  not-so-simple  problem  for  deciding  when  a 
period  in  text  corresponds  to  the  end  of  a  declarative  sen¬ 
tence.  This  is  not  as  trivied  a  classification  problem  as  it 
may  first  seem.  While  a  period,  by  convention,  must  occur 
at  the  end  of  a  declarative  sentence,  one  can  also  occur  in 
abbreviations.  Abbreviations  can  also  occur  at  the  end  of 
a  sentence.  The  tagged  Brown  corpus  [Kucera  and  Francis 
1967]  of  a  million  words  indicates  that  about  90%  of  periods 
occur  at  the  end  of  sentences,  10%  at  the  end  of  abbrevia¬ 
tions,  and  about  1  /2%  in  both.  The  two  space  rule  after  an 
end  stop  is  often  ignored  and  is  never  present  in  many  text 
sources  (e.g.,  the  AP  news). 

Figure  1  shows  a  classification  tree  for  this  problem 
trained  on  the  Brown  corpus.  Let  us  first  see  how  to  use 
such  a  tree  for  classification.  Then  we  will  see  how  the  tree 
was  generated. 

The  decision  of  when  a  period  occurs  at  the  end  of  a 
sentence  will  depend  on  factors  such  as  whether  the  word 
following  the  period  is  capitalized  or  if  the  word  containing 
the  period  is  a  common  abbrevation.  Suppose  we  see  the 
text  fragment  “Smith.  The”.  Does  the  period  after  “Smith” 
occur  at  the  end  of  a  sentence? 

Starting  at  the  root  node  in  Figure  1,  the  first  decision 
is  whether  the  word  after  the  period,  “the”  (case  ignored 
here),  is  more  likely  than  27%  of  the  time  to  occur  at  the 
beginning  of  a  sentence  relative  to  its  frequency  in  text.  The 
answer  is  no  (estimated  from  a  database  described  below),  so 
the  left  bremch  is  taken.  The  next  split  is  whether  the  word 
containing  the  period,  “smith”,  is  more  likely  than  1%  to 
occur  at  the  end  of  a  sentence  relative  to  its  frequency  in  text. 
The  answer  is  yes,  so  the  right  branch  is  taken.  The  next 
split  concerns  the  case  of  the  word  after  the  period.  Since  it 
is  a  capit^dized  word  the  left  split  is  taken.  Finally,  the  last 
question  is  whether  the  word  containing  the  period,  “Smith”, 
is  one  of  several  common  abbreviation  types.  Since  it  is  not, 
the  left  branch  is  taken  to  a  terminal  node  that  classifies  this 
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Figure  1.  Classification  tree  for  end-of-sentence  detection. 


case  as  indeed  at  the  end  of  a  declarative  sentence. 

In  the  training  set,  5137  of  the  5283  examples  that 
reached  this  node  were  correctly  classified.  This  tree  is  a 
subtree  of  a  better  classifier  to  be  described  in  the  next  sec¬ 
tion;  this  example  was  pruned  for  illustrative  purposes. 

This  is  an  example  of  a  classification  tree,  since  the  deci¬ 
sion  is  to  choose  one  of  several  classes;  in  this  case,  there  are 
two  classes:  {  end  -of-  sentence,  not  -  end- of -sentence  } . 
In  other  words,  the  predicted  variable,  y,  is  categorical.  Trees 
can  be  created  for  continuous  y  also.  In  this  case  they  are 
called  regression  trees  with  the  terminal  nodes  labelled  with 
a  real  number  (or,  more  generally,  a  vector). 

Classifying  with  an  existing  tree  is  easy;  a  more  difficult 
issue  is  how  to  generate  the  tree  for  a  given  problem.  There 
are  three  basic  questions  that  have  to  be  answered  when 
generating  a  tree;  (1)  what  are  the  splitting  rules,  (2)  what 
are  the  stopping  rules,  and  (3)  what  prediction  is  made  at 
each  terminal  node? 

Let  us  begin  answering  these  questions  by  introducing 
some  notation.  Consider  that  we  have  N  samples  of  data, 
with  each  sample  consisting  of  M  features,  xi,X2,xj,...Xm- 
In  the  end-of-sentence  detection  example,  ij  might  be  the 
case  of  the  word  following  the  period,  12  the  probability  that 
the  following  word  begins  a  sentence,  etc.  Just  as  the  y  (de¬ 
pendent)  variable  can  be  continuous  or  categorical,  so  can 
the  X  (independent)  variables.  E.g.,  word  case  is  categorical 
(can  not  be  usefully  ordered),  while  beginning  word  proba¬ 
bility  is  continuous. 


The  first  question  —  what  stopping  rule?  —  refers  to 
what  split  to  take  at  a  given  node.  It  has  two  parts;  (a) 
what  candidates  should  be  considered,  and  (b)  which  is  the 
best  choice  among  candidates  for  a  given  node? 

A  simple  choice  is  to  consider  splits  based  on  one  x  vari¬ 
able  at  a  time.  If  the  independent  variable  being  considered 
is  continuous  —00  <  x  <  00,  consider  splits  of  the  form: 

X  <  k  vs.  X  >  k,  Vk. 

In  other  words,  consider  all  binary  cuts  of  that  variable.  If 
the  independent  variable  is  categorical  x  €  {1,2,  ...,n}  =  X, 
consider  splits  of  form: 

X  €  -4  vs.  xeX  -A,  VA  C  X. 

In  other  words,  consider  all  binary  partitions  of  that  variable. 
More  sophisticated  splitting  rules  would  allow  combinations 
of  a  such  splits  at  a  given  node;  e.g.,  linear  combinations  of 
continuous  variables,  or  boolean  combinations  of  categorical 
variables. 

A  simple  choice  to  decide  which  of  these  splits  is  the 
best  at  a  given  node  is  to  select  the  one  that  minimizes  the 
estimated  cicissification  or  prediction  error  after  that  split 
based  on  the  training  set.  Since  this  is  done  stepwise  at  each 
node,  this  is  not  guaranteed  to  be  globally  optimal  even  for 
the  training  set. 

In  fact,  there  are  cases  where  this  is  a  bad  choice.  Con¬ 
sider  Figure  2,  where  two  different  splits  are  illustrated  for 
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Figure  2.  Two  different  splits  with  the  same  misclassifica- 
tion  rate  [after  Breiman,  et  al.  1984], 


a  classification  problem  having  two  classes  (No.  1  and  No. 
2)  and  800  samples  in  the  training  set  (with  400  in  each 
class).  If  we  label  each  child  node  according  to  the  greater 
class  present  there,  we  see  that  the  two  different  splits  illus¬ 
trated  both  give  200  samples  misclassified.  Thus,  minimizing 
the  error  gives  no  preference  to  either  of  these  splits  [after 
Breiman,  et  al.  1984). 

The  split  on  the  right,  however,  is  better  because  it  ere- 
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ates  at  least  one  very  pure  node  (no  misclassification)  which 
needs  no  more  splitting.  At  the  next  split,  the  other  node 
can  be  attacked.  In  other  words,  the  stepwise  optimization 
makes  creating  purer  nodes  at  each  step  desirable.  A  simple 
way  to  do  this  is  to  minimize  the  entropy  at  each  node  for 
categorical  y.  Minimizing  the  mean  square  error  is  a  common 
choice  for  continuous  y. 

The  second  question  —  what  stopping  rule?  —  refers 
when  to  declare  a  node  terminal.  Too  large  trees  may  match 
the  training  data  well,  but  they  won’t  necessarily  perform 
well  on  new  test  data,  since  they  have  overfit  the  data.  Thus, 
a  procedure  is  needed  to  find  an  “honest-sized"  tree. 

Early  attempts  at  this  tried  to  find  good  stopping  rules 
based  on  absolute  purity,  differential  purity  from  the  par¬ 
ent,  and  other  such  “local”  evaluations.  Unfortunately,  good 
thresholds  for  these  are  hard  to  find  and  vary  from  problem 
to  problem. 

A  better  choice  is  as  follows:  (a)  grow  an  over-large  tree 
with  very  conservative  stopping  rules,  (b)  form  a  sequence 
of  subtrees,  7’o,...,7’„,  ranging  from  the  full  tree  to  just 
the  root  node,  (c)  estimate  an  “honest”  error  rate  for  each 
subtree,  and  then  (d)  choose  the  subtree  with  the  minimum 
“honest”  error  rate. 

To  form  the  sequence  of  subtrees  in  (b),  vary  a  from  0 
(for  full  tree)  to  oo  (for  just  the  root  node)  in: 

mm[R(T)^a\T\\. 

where  R{T)  is  the  classification  or  prediction  error  for  that 

subtree  and  |  T  |  is  the  number  of  terminal  nodes  in  the 
subtree.  This  is  called  the  cost-complexity  pruning  sequence. 

To  estimate  an  “honest”  error  rate  in  (c),  test  the  sub¬ 
trees  on  data  different  from  the  training  data,  e.g.,  grow  the 
tree  on  9/10  of  the  available  data  and  test  on  1/10  of  the 
data  repeating  10  times  and  averaging.  This  is  often  called 
cross-validation. 

Figure  3  shows  misclassification  rate  vs.  tree  length  for 
the  end-of-sentence  classification  problem  using  a  subset  of 
the  input  features  describe  below.  The  bottom  curve  shows 
misclassification  for  the  training  data,  which  continues  to 
improve  with  increasing  tree  length.  The  higher  curve  shows 
the  cross-validated  misclassification  rate,  which  reaches  a 
minimum  with  a  tree  size  of  about  20  and  then  rises  again 
with  increasing  tree  length. 

The  last  question  —  what  prediction  is  made  at  a  termi¬ 
nal  node?  —  is  easy  to  answer.  If  the  predicted  variable  is 
categorical,  choose  the  most  frequent  class  among  the  train¬ 
ing  samples  at  that  node  (plurality  vote).  If  it  is  continuous, 
choose  the  mean  of  the  training  samples  at  that  node. 

The  approach  described  here  can  be  used  on  quite  large 
problems.  We  have  grown  trees  with  hundreds  of  thousands 
of  samples  with  a  hundred  different  independent  variables. 
The  (expected)  time  complexity,  in  fact,  grows  only  linearly 
with  the  number  of  input  variables  (worst  case  is  quadratic). 
The  one  expensive  operation  is  forming  all  binary  partitions 
for  categorical  x's.  This  increases  exponentially  with  the 
number  of  distinct  values  the  variable  can  assume. 
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Figure  3. 


Let  us  now  discuss  in  detail  the  applications  of  these 
ideas  to  some  problems  in  speech  and  language. 

3.  End  of  sentence  detection 

As  the  first  example,  let  us  look  again  at  the  end-of- 
sentence  detection  problem  described  above.  A  more  com¬ 
prehensive  tree  was  generated  using  the  the  following  fea¬ 
tures: 

•  Prob[word  with  “."  occurs  at  end  of  sentence] 

•  Prob[word  after  “.”  occurs  at  beginning  of  sentence) 

•  Length  of  word  with  “.” 

•  Length  of  word  after  “.” 

•  Case  of  word  with  “.”:  Upper,  Lower,  Cap,  Numbers 

•  Case  of  word  after  “.”:  Upper,  Lower,  Cap,  Numbers 

•  Punctuation  after  “.”  (if  any) 

•  Abbreviation  class  of  word  with  “.”: 

—  e.g.,  month  name,  unit-of-measure,  title,  ad¬ 
dress  name,  etc. 

The  choice  of  these  features  was  based  on  what  humans 
appeu  to  use  (at  least  when  constrained  to  looking  at  a 
few  words  around  the  “.”).  Facts  such  as  “Is  the  word  after 
the  capitalized?”,  “Is  the  word  with  the  a  common 
abbreviation?”,  “Is  the  word  after  the  “.”  likely  found  at 
the  beginning  of  a  sentence?”,  etc.  can  be  answered  with 
these  features. 
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The  word  probabilities  indicated  above  were  computed 
from  the  25  million  words  of  AP  news,  a  much  larger  (and  in¬ 
dependent)  text  database.  (In  fact,  these  probabilities  were 
for  the  beginning  and  end  of  paragraphs,  since  these  are  ex¬ 
plicitly  marked  in  the  AP,  while  end  of  sentences,  in  general, 
are  not.) 

The  resulting  classification  tree  correctly  identifies 
whether  a  word  ending  in  a  is  at  the  end  of  a  declar¬ 
ative  sentence  in  the  Brown  corpus  with  99.8%  accuracy. 
The  majority  of  the  errors  are  due  to  difficult  cases,  e.g.  a 
sentence  that  ends  with  “Mrs.”  or  begins  with  a  numeral  (it 
can  happen). 

4.  Segment  duration  modelling  for  speech  synthesis 

400  utterances  from  a  single  speaker  and  4000  utter¬ 
ances  from  400  speakers  (the  TIMIT  database  [Fisher,  et 
al.  1987])  of  American  English,  both  which  are  manually 
hand-segmented  and  phonetically  labelled,  were  used  sepa¬ 
rately  to  build  regression  trees  that  predict  the  duration  of 
the  phonetic  segments.  Predicting  these  durations  is  impor¬ 
tant  both  in  work  on  speech  synthesis  and  recognition.  The 
following  features  were  used; 

•  Segment  Context: 

—  Segment  to  predict 
—  Segment  to  left 
—  Segment  to  right 

•  Stress  (0,  1,  2) 

•  Word  Frequency;  (rel.  25M  AP  words) 

•  Lexical  Position: 

—  Segment  count  from  start  of  word 
—  Segment  count  from  end  of  word 
—  Vowel  count  from  start  of  word 
—  Vowel  count  from  end  of  word 

•  Sentence  Position; 

—  Word  count  from  start  of  sentencee 
—  Word  count  from  end  of  sentencee 

•  Dialect:  N,  S,  NE,  W,  SMid,  NMid,  NYC,  Brat 

•  Speaking  Rate:  (rel.  to  calibration  sentences) 

Coding  the  phonetic  context  required  special  considera¬ 
tions  since  more  than  50  phones  (using  the  TIMIT  labelling) 
can  precede  a  stop  in  this  context.  If  this  were  treated  as  a 
single  feature,  more  than  2*®  binary  partitions  would  have  to 
be  considered  for  this  variable  at  each  node,  clearly  making 
this  approach  impractical.  Chou  [1987]  proposes  one  solu¬ 
tion,  which  is  to  use  k-means  clustering  to  find  sub-optimal, 
but  good  paritions  in  linear  complexity. 

The  solution  adopted  here  is  to  classify  each  phone  in 
terms  of  4  features,  consonant  manner,  consonant  place, 
“vowel  manner”,  and  “vowel  place”,  each  class  taking  on 
about  a  dozen  values.  Consonant  manner  takes  on  the  usual 


values  as  voiced  fricative,  unvoiced  stop,  nasal,  etc.  Conso¬ 
nant  manner  takes  on  values  such  as  bilabial,  dental,  velar, 
etc.  “Vowel  manner”  takes  on  values  such  as  monopthong, 
diphthong,  glide,  liquid,  etc.  and  “vowel  place”  takes  on  val¬ 
ues  such  as  front-low,  central-mid-high,  back-high,  etc.  All 
can  take  on  the  value  n/a  if  they  do  not  apply;  e.g.,  when  a 
vowel  is  being  represented,  consonant  manner  and  place  are 
assigned  n/a.  In  this  way,  every  segment  is  decomposed  into 
four  multi-valued  features  that  have  acceptable  complexity 
to  the  classification  scheme  and  that  have  some  phonetic  jus¬ 
tification. 

The  word  frequency  was  included  as  a  continuoulsy 
graded  “function  word”  detector  and  was  based  on  six 
months  of  AP  news  text.  The  stress  was  obtained  from  a 
dictionary  (which  is  easy,  but  imperfect).  The  last  two  fea¬ 
tures  were  used  only  for  the  multi-speaker  database.  The  di¬ 
alect  information  was  coded  with  the  TIMIT  database.  The 
speaking  rate  is  specified  as  the  mean  duration  of  the  two 
calibration  sentences,  which  were  spoken  by  every  speaker. 

Over  70%  of  the  durational  variance  for  the  single 
speaker  and  over  60%  for  the  multiple  speakers  were  ac¬ 
counted  for  by  these  trees.  Figure  4  shows  durations  and 
duration  residuals  for  all  the  segments  together.  The  large 
tree  sizes  here,  many  hundreds  of  nodes,  make  them  some 
what  uninteresting  to  display. 

These  trees  were  used  to  derive  durations  for  a  text- 
to-speech  synthesizer.  This  approach  offers  a  promising  al¬ 
ternative  to  heuristically  derived  duration  rules  [e.g.,  Klatt 
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1976].  Since  tree  building  and  evaluation  is  rapid  once  the 
data  are  collected  and  the  cwdidate  features  specified,  this 
technique  can  be  readily  applied  to  other  feature  sets  and  to 
other  languages. 

This  approach  is  very  data-intensive,  though.  Our 
databases  have  tens  or  hundreds  of  thousands  of  segments. 
We  believe  really  good  duration  modelling  will  involve  at 
least  an  order  of  magnitude  more  data.  This  presents  not 
so  much  a  computational  problem,  given  the  efficient  algo¬ 
rithms  for  tree  construction  available,  but  a  data  collection 
problem.  We  believe  that  automatic  transcription  [Ljolje 
and  Riley]  may  ultimately  be  the  way  to  proceed. 

5.  Phoneme-to-phone  prediction 

The  task  here  is  given  a  phonemic  transcription  of  an 
utterance,  e.g.,  based  on  dictionary  lookup,  predict  the  pho¬ 
netic  realization  produced  by  a  speaker  [see  also  Lucassen, 
et.  al.  1984;  Chou,  1987;  Riley  1989,  1991;  Chen  1990; 
Randolph  1990].  For  example,  when  will  a  T  be  flapped  (as 
in  American  English  pronunciation  of  ’pretty’)  or  released 
(as  in  phrase-initial  T’s).  We  used  the  following  features  to 
decide  this  problem  extracted  from  the  TIMIT  database: 

•  Phonemic  Context: 

—  Phoneme  to  predict 
—  Three  phonemes  to  left 
—  Three  phonemes  to  right 

•  Stress  (0,  1,  2) 

•  Word  Frequency;  (rel.  25M  AP  words) 

•  Dialect:  N,  S,  NE,  W,  SMid,  NMid,  NYC,  Brat 

•  Lexical  Position: 

—  Phoneme  count  from  start  of  word 
—  Phoneme  count  from  end  of  :d 

•  Phonetic  Context:  phone  predicted  to  left 

The  phonemic  context  was  coded  in  a  seven  segment  win¬ 
dow  centered  on  the  phoneme  to  realize,  again  using  the  4 
feature  decomposition  described  above.  The  other  features 
are  similar  to  the  duration  prediction  problem.  Ignore  the 
last  feature,  for  the  moment. 

The  tree  for  all  phonemes  grown  on  these  features  pre¬ 
dicts  on  the  average  83%  of  the  TIMIT  labellings  exactly.  A 
large  percentage  of  the  errors  are  on  the  precise  labelling  of 
reduced  vowels  as  either  IX  or  AX. 

A  list  of  alternative  phonetic  realizations  can  also  be  pro¬ 
duced  from  the  tree,  since  the  relative  frequencies  of  different 
phones  appearing  at  a  given  terminal  node  can  be  retained. 
Figure  5  shows  such  a  listing  for  the  utterance.  Would  your 
name  be  Tom?  .  (We  use  the  TIMITBET  phonetic  sym¬ 
bols  in  these  examples  [Fisher,  et  al.  1987]).  It  indicates, 
for  example,  that  the  D  in  “would”  is  most  likely  uttered 
as  a  DCL  JH  in  this  context  (59%  of  the  time),  followed  by 
DCL  D  (28%).  On  the  average  five  alternatives  per  phoneme 
are  sufficient  to  cover  99%  of  the  possible  phonetic  realiza- 
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Figure  5.  Phonetic  alternatives  for  “Would  your  name  be 
Tom? 


tions.  This  can  be  used,  for  example,  to  greatly  constrain  the 
number  of  alternatives  that  must  be  considered  in  automatic 
segmentation  when  the  orthography  is  known. 

These  a  priori  probabilities,  however,  do  not  take  into 
account  the  phonetic  context,  only  the  phonemic.  For  ex¬ 
ample,  if  DCL  JH  is  uttered  for  the  phoneme  D  in  the  ex¬ 
ample  in  Figure  5,  then  the  Y  is  most  likely  deleted  and  not 
uttered.  However,  the  overall  probability  that  a  Y  is  uttered 
in  that  phonemic  context  (averaging  both  D  going  to  DCL 
JH,  D,  etc.)  is  greatest.  The  point  is  that  to  incorporate  the 
fact  that  “D  goes  to  DCL  JH  implies  Y  usually  deletes”  is 
that  transition  probabilities  should  be  taken  into  account. 

This  can  be  done  by  including  an  additional  feature  for 
the  phonetic  identity  of  the  previous  segment.  The  output 
listing  then  becomes  a  transition  matrix  for  each  phoneme. 
The  best  path  through  such  a  lattice  can  be  found  by  dy¬ 
namic  programming. 

This,  coupled  with  a  dictionary,  can  also  be  used  for 
letter-to-sound  rules  for  a  synthesizer  (when  the  entry  is 
present  in  the  dictionary).  The  effect  of  using  the  TIMIT 
database  for  this  purpose  is  a  somewhat  folksy  sounding  syn¬ 
thesizer.  Having  the  D  “Would  your”  uttered  as  a  JH  may 
be  appropriate  for  fluent  English,  but  it  sounds  a  bit  forced 
with  existing  synthesizers.  Too  much  else  is  wrong.  A  very 
carefully  uttered  database  by  a  professional  speaker  would 
give  better  results  for  this  application  of  the  phoneme-to- 
phone  tree. 

6.  Discussion 

On  the  whole,  we  have  found  classification  and  regres¬ 
sion  trees  quite  useful  in  modelling  a  variety  of  phenonema 
in  speech  and  language.  In  part,  it  is  their  ability  to  han- 
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die  both  categorical  and  continuous  inputs  and  outputs  that 
makes  them  attractive  to  us.  The  fact  that  they  offer  ef¬ 
ficient  algorithms,  a  well-established  cross-validation  proce¬ 
dure,  and  a  relatively  perspicuous  representation  makes  them 
more  appealing  to  us  than,  say,  back-propogation  neural  net¬ 
works  for  the  problems  we  have  described. 

The  principal  difficulty  we  have  found  with  this  and  sim¬ 
ilar  statistical  approaches  is  that  while  the  trees  classify  well 
most  of  the  time,  they  occctsionally  make  egregious  errors. 
When  noticed,  it  is  possible  to  correct  these  errors  by  hand 
modification  of  the  trees.  This  is,  however,  quite  tedious. 
Further,  if  new  data  are  used  or  new  input  features  are  tried, 
the  editing  has  to  be  redone  (if  the  error  remains). 

What  would  be  most  appealing  to  us  would  be  tech¬ 
niques  that  would  allow  easy  mixing  of  statistical  learning 
with  hand  specification.  The  user  could  hand  specify  what 
he  is  sure  of  and  leave  to  the  statistics  to  fill  in  the  rest  the 
best  it  can,  letting  us  have  our  cake  and  eat  it  too. 
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Afaertract 

—  Text  analysis  is  a  hot  topic,  and  for  good  reason. 
Text  is  more  available  than  ever  before.  Just  ten  years 
ago,  the  one-million  word  Brown  Corpus  (Francis  and 
Kucera,  1982)  was  still  considered  large,  but  even  then, 
there  were  much  larger  corpora  in  use  such  as  the  18 
million  word  Birmingham  Corpus  (Sinclair  1987a, 
1987b).  These  days,  there  are  many  places  that  regu¬ 
larly  use  samples  of  text  running  into  the  hundreds  of 
millions  of  words.  And  it  is  very  likely  that  billions  of 
words  will  be  available  very  soon. 

All  of  this  data  provides  a  great  research  oppor¬ 
tunity;  it  easier  these  days  to  corpus  data  much  more 
effectively  than  it  was  in  the  1950s,  the  last  time  that 
empiricism  was  in  fashion.  Text  analysis  focuses  on 
broad  (though  possibly  superf.ciJ)  coverage  of  unres¬ 
tricted  text,  rather  than  a  deep  analysis  of  a  restricted 
domain.  This  pragmatic  view  toward  coverage  and  pe^ 
formaiKe  distinguishes  text  analysis  from  so-called 
“intelligent”  approaches  such  as  natural  language 
understanding.  This  approach  has  produced  a  number  of 
tools  such  as  spelling  correctors  and  part  of  speech 
taggers  that  work  on  unrestricted  text,  with  reasonable 
accuracy  and  efficiency. 

!•  Recognhian  .^^icsdcxiB 

Recognition  applications  are  perhaps  the  most 
obvious  applications  for  large  bodies  of  text  Three 
examples  of  recognition  applications  will  be  mentioned 
here:  (1)  Speech  Recognition,  (2)  Optical  Character 
Recognition  (OCR),  and  (3)  Spelling  Correction. 

Inutgine  a  noisy  chaimel,  such  as  a  speech  recog¬ 
nition  machine  that  almost  hears,  an  optical  character 
recognition  (OCR)  machine  that  almost  reads,  or  a  typist 
that  almost  types.  Good  text  ( M^)  goes  into  the  chanrel, 
and  corrupted  text  ( W„)  comes  out  the  other  end. 

Wi  —*  Noisy  Channel  — *  W, 

How  can  an  automatic  procedure  recover  the  good  input 
text,  Wi,  from  the  corrupted  output,  W,?  In  principle, 
one  can  recover  the  most  likely  input  by  hypothesizing 
all  possible  input  texts,  and  selecting  the  input  text 
with  the  highest  score.  Using  a  classic  Bayesian  argu¬ 


ment,  the  score  is  computed  by  taking  the  product  of  the 
prior  probability,  Pr(U(),  and  the  chaimel  probabilify, 
Pr{W„  I  M{).  This  procedure  can  be  written  as: 

ARGMAXPt[  VV^)  Pr(  W, ) 

where  ARGMAX  finds  the  argument  with  the  maximum 
score. 

The  prior  probability,  also  known  as  the  language 
model,  is  the  probability  that  the  VV[-  would  be  input  to 
the  channel.  For  example,  in  the  speech  recognition 
application,  it  is  the  probability  that  someone  would 
utter  Wi,  whereas  in  the  spelling  correction  application, 
it  is  the  probability  that  someone  would  type  V^-.  In 
practice,  the  prior  is  approximated  by  computing  various 
statistics  over  a  large  sample  of  text. 

The  channel  probability  is  the  probability  that  the 
channel  would  transform  the  word  sequence  Wi  into  the 
sequence  W„.  This  is  relatively  high  if  BJ-  is  the  same 
as  or  very  “similar”  to  W„,  where  the  definition  of 
“similar”  depends  on  the  application.  The  channel  for 
speech  recognition,  for  example,  will  have  a  high  proba¬ 
bility  of  mapping  words  that  sound  similar  (e.g., 
“writer”  and  “rider”  in  many  American  dialects)  into 
the  same  output  representation.  However,  in  other 
applications  such  as  optical  character  recognition, 
“writer”  and  “rider”  are  unlikely  to  be  confused  by 
the  clmnnel  because  these  words  are  optically  quite  dis¬ 
tinct.  Thus,  the  channel  model  clearly  depends  on  the 
application  as  illustrated  in  the  Table  1. 

Table  1:  Examples  of  Channel  Confusions 
in  Different  Applications 


Application 

Input 

Output 

Speech 

writer 

rider 

Recognition 

here 

hear 

Optical  1 

all 

all  {A-one-L) 

Character 

of 

o{ 

Recognition 

form 

farm 

Spelling 

government 

goverment 

Correction 

occurred 

occured 

commercial 

commerical 

It  is  convenient  to  partition  the  prior  and  the  chan¬ 
nel  in  this  way,  so  that  the  same  prior  can  be  used  for  a 
variety  of  recognition  applications  including  speech 
recognition,  optical  character  recognition  and  spelling 
correction.  The  channel,  of  course,  generally  cannot  be 
ported  from  one  t^plication  to  another. 

2.  Spielling  CwrecticMi 

I  have  found  that  spelling  correction  is  a  good 
application  to  look  at  because  it  is  analogous  to  many 
important  recognition  applications  based  on  a  noisy 
channel  model  (such  as  speech  recognition),  though 
somewhat  simpler  and  therefore  possibly  more  amenable 
to  detailed  statistical  analysis.  In  (Kemighan,  Church, 
and  CaJe,  1990),  we  described  a  program  called  correct 
which  inputs  a  misspelled  word  such  as  absurb,  and  out¬ 
puts  a  li.st  of  candidate  corrections  sorted  by  probability; 
absorb  (o6?o),  absurd  (-14%).  The  probability  scores  are 
the  novel  contribution;  there  are  have  been  many  pro¬ 
grams  in  die  past  that  generated  a  (long)  list  of  candi¬ 
date  corrections,  but  few  have  attempted  to  score  the 
candidates  by  a  stochastic  model  of  the  prior  probability 
of  observing  the  candidate  correction  Pr(c)  and  a  chan¬ 
nel  probability  of  observing  a  particular  typo  given  the 
candidate  correction  Pr{t\c).  Both  of  these  probabili¬ 
ties  were  estimated  from  about  50  million  words  of 
Associated  FVess  newswire  (which  includes  about  15,000 
typos  which  are  used  to  train  the  channel  model). 

In  evaluating  the  program,  we  restricted  our  atten¬ 
tion  to  564  typos  that  had  exactly  two  candidate  correc¬ 
tions.  A  panel  of  three  judges  were  given  the  typo  (e.g., 
absurb),  the  two  candidate  corrections  (e.g.,  absorb  and 
absurd)  and  a  concordance  line  (e  g.,  it  is  absurb  and 
probably  obscene  for...),  and  were  asked  to  select  one  of 
tile  two  corrections  (or  none-of-the-above).  The  judges 
found  this  task  more  difficult  than  they  had  anticipated, 
and  very  time  consuming  (it  took  each  judge  about  four 
hours  to  grade  the  564  examples).  In  addition,  the 
judges  felt  that  the  task  would  have  been  much  harder 
without  the  concordance  fine,  suggesting  that  context 
should  be  incorporated  into  the  program. 

Table  2  shows  that  correct  agrees  with  the  major¬ 
ity  of  the  judges  in  87%  of  the  332  cases  of  interest.^  In 
order  to  help  calibrate  this  result,  we  compared  correct 
to  tiiree  inferior  methods:  channel-only,  prior-only  and 
chance.  Table  2  shows  that  both  the  channel-only  and 
the  prior-only  models  provide  a  significant  contribution 
over  chance,  and  that  correct,  which  is  a  combination  of 
the  two,  is  significantly  better  than  either  in  isolation. 

'  Wf  rfslricled  our  attention  to  tliose  casfs  where  al  least  two 
judges  selected  one  of  the  two  candidate  corrections,  and  they 
agreed  with  each  other. 


Table  2  also  shows  that  the  judges  are  significantly 
better  than  all  of  the  programs,  indicating  that  there  is 
room  for  improvement. 


Table  2;  Evaluation  of  Correct 


Method 

Discrimination 

% 

correct 

286/329 

87  ±1.9 

Judge  1 

271/27i 

99  ±0.5 

Judge  2 

271/275 

99  ±0.7 

Judge  3 

271/281 

96  ±1.1 

channel-only 

263/329 

80  ±2.2 

prior-only 

247/329 

75  ±2.4 

chance 

172/329 

52  ±2.8 

The  program,  of  course,  is  not  making  use  of  con¬ 
text  whereas  the  human  judg-  .  did  have  access  to  a  con¬ 
cordance  line.  The  following  examples  show  that  the 
task  is  extremely  difficult  without  context 

Table  3;  Hard  without  Context 

TVpo _ Choice  1 _ Choice  2 

actuall  actual  actually 

constuming  consuming  costuming 

conviced  convicted  convinced 

confusin  confusing  confusion 

Of  course,  the  task  becomes  much  easier  if  the  context 
is  provided  as  demonstrated  by  the  following  four  con¬ 
cordance  lines. 

1.  in  determining  whether  the  defendant  actuall  will 
die.  In  the  1985  decision,  the... 

2.  on  Friday  night,  a  show  as  lavish  in  constumina 
and  lighting  as  those  the  late  Liberace  used  to... 

3.  of  the  area.  “When  we’re  conviced  ^uld  the  Peru¬ 
vians  are  convinced  (the  base  camp)... 

4.  The  political  situation  grew  more  confusin  tod-W, 
with  an  official  media  report  indicating... 

Both  (Mays  et  at.,  1990)  and  (Church  and  Gale, 
1991a)  have  found  that  statistical  n-gram  models  of  con¬ 
text  can  help  considerably,  although  performance  is  still 
far  below  that  of  die  human  judges.  A  quick  look  at  the 
concordance  lines  above  shows  (a)  that  the  relevant  con¬ 
textual  clues  are  often  fairly  close  to  the  typo,  and  (b) 
that  there  are  relatively  few  cases  that  make  use  of 
long-distance  syntactic  dependencies,  (a)  suggests  that 
simple  n-gram  methods  might  work  fairly  well  in  many 
cases,  and  (b)  suggests  that  more  complicated  “intelli¬ 
gent”  parsing  methods  might  not  be  worth  the  trouble. 

3.  The  TVigram  Model 

One  of  the  simpler  and  more  popular  priors  is  the 
n-gram  model.  This  model  makes  the  simplifying 
assumption  that,  word  probabilities  depend  on  only  the 
previous  n-I  words,  and  that  long-distance  dependences 
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which  extend  beyond  this  limited  window  can  be 
ignored.  Jelinek  (1985)  uses  the  example  shown  in 
Table  4  to  illustrate  the  power  of  the  trigram  model.  In 
the  sentence,  We  need  to  resolve  all  the  important  issues 
within  the  next  two  days,  most  of  the  words  are 
extremely  predictable  from  the  trigram  context  (the 
current  word  plus  the  previous  two).  Note  that  we  is  the 
9‘*  most  likely  word  to  begin  a  sentence  in  his  model; 
the  words  the,  this,  one,  ...,  in  are  more  likely  to  begin  a 
sentence  than  we.  The  word  need  is  found  to  be  the  7** 
most  likely  word  to  follow  we ;  the  words  are,  will,  ..., 
do  are  more  likely  than  need.  And  so  on.  Jelinek  uses 
this  example  to  argue  that  the  rank  is  usually  very  small 
in  comparison  to  the  vocabulary  size,  which  was  20,000 
words  in  this  example. 


Table  4:  Example  of  Trigrams  (Jelinek,  1985) 


The  This  One  Two  A  Three  Please  In  We 
are  will  the  would  also  do  need 

to 

know  have  understand  ...  resolve 
the  this  these  problems  ...  all 
issues  problems  the 
necessary  data  information  ...  important 
role  thing  that ...  issues 
and  from  in  to  are  with  ...  within 

the 
next 
be  two 

meeting  months  years  ...  days 


9 

7 

1 

98 

9 

3 

641 

9 

66 

1 

1 

2 

7 


Note  that  function  words  (e.g.,  to,  the)  are  gen¬ 
erally  more  predictable  than  content  words  (e.g.,  resolve, 
important).  This  turns  out  to  be  important  in  speech 
recognition  because  the  shorter  function  words  are  more 
easily  confused  by  the  channel  model  and  so  it  is  for¬ 
tunate  that  they  are  more  predictable  from  context. 

Some  of  the  content  words  also  have  relatively 
small  ranks.  Consider,  the  content  word  issues,  for 
example.  It  turns  out  that  there  are  relatively  few  words 
that  follow  the  word  important  (at  least,  in  the  sub- 
domain  of  IBM  office  correspondences).  This  kind  of 
collocational  (or  co-occurrence)^  constraint  between 
words  are  often  not  captured  very  well  with  a  syntactic 
parser.  Perhaps  this  is  the  reason  why  trigram  models 
have  tended  to  out-perform  so-called  “intelligent” 
approaches,  when  pt’formance  is  measured  in  terms  of 

®  HaJliday  (1966,  p.  150)  was  Y’ry  interested  in  the  difference 
between  strong  and  powerful.  Although  both  words  have  very 
similar  syntax  and  semantics,  there  do  seem  to  be  some  contexts 
where  one  word  is  much  more  appropriate  than  the  other,  e.g., 
strong  tea  vs.  powerful  drugs.  The  terms  collocation,  co¬ 
occurrence  and  lezis  have  been  used  to  describe  these  kinds  of 
constraints  on  pairs  of  words. 


entropy. 

4.  Wyrd  Frequences  and  Wevd  AssociaticD  NbrniB 

The  Irigram  model  does  a  good  job  of  modeling 
word  frequencies  which  are  very  important,  as  any 
psycholinguist  knows.  Generally  speaking,  subjects 
respond  more  quickly  and  more  accurately  to  a  high  fre¬ 
quency  word  (e.g.,  a  word  that  sqjpeais  relatively  often 
in  a  sample  of  text  such  as  the  Brown  Corpus)  than  to 
an  imusual  low  frequency  word.  The  word  association 
effect  is  similar  except  that  it  involves  pairs  of  words. 
In  general,  subjects  respond  more  quickly  and  nwre 
accurately  to  a  word  like  doctor  if  it  follows  a  highly 
associated  word  such  as  nurse  (Meyer,  Schvaneveldt  and 
Ruddy,  1975,  p.  98). 

Word  frequencies  are  fairly  easy  to  estimate  from 
a  sample  of  text  such  as  the  Brown  Corpus.  Hanks  and 
I  have  argued  that  word  associations  should  also  be 
estimated  by  computing  various  statistics  over  large  cor¬ 
pora  (Church  and  Hanks,  1990).  It  is  more  common  in 
the  psycholinguistic  literature  to  find  a  study  like 
(Palermo  and  Jenkins,  1964);  they  estimated  word  asso¬ 
ciation  norms  for  200  words  by  asking  a  few  thousand 
subjects  (psychology  undergraduates)  to  write  down  a 
word  after  each  of  the  words  to  be  measured.  Results 
were  reported  in  tabular  form,  indicating  which  words 
were  written  down,  and  by  how  many  subjects,  factored 
by  grade  level  and  sex.  The  word  doctor,  for  example, 
is  reported  on  pp.  98-100,  to  be  most  often  associated 
with  nurse,  followed  by  sick,  health,  medicine,  hospital, 
man,  sickness,  lawyer,  and  about  70  more  words. 

5.  Strengths  and  Weaknesses 

The  main  advantage  of  the  trigram  model  is  that  it 
has  very  low  entropy,  1.76  bits  per  character  (Brown  et 
al.,  1991).  Parsers  generally  don’t  do  as  well  because 
they  tend  to  ignore  word  frequencies.  The  trigram 
model  is  also  able  to  capture  some  collocations  and 
word  associations. 

The  most  obvious  weakness  with  the  trigram 
model  is  the  lack  of  syntax;  the  model  makes  no  attempt 
to  capture  long-distance  dependencies  such  as  syntactic 
agreement,  conjunction  and  wh-movement  In  fact,  the 
lack  is  syntax  is  probably  not  the  most  serious  problem 
with  the  model.  The  sparse-data  problem  is  extremely 
serious  since  many  trigrams  do  not  appear  very  often  in 
the  training  corpus,  if  at  all.  In  addition,  the  trigram 
model  assumes  that  trigrams  have  a  binomial  distribu¬ 
tion,  an  assumption  which  is  often  violated  in  practice. 
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6.  FVirsers  May  Not  Help  Very  Much 

It  has  been  common  practice,  especially  during 
the  first  Darpa  Speech  Understanding  Project  (Klatt, 
1977),  to  try  to  use  a  syntactic  parser  to  take  advantage 
of  contextual  constraints.  Unfortunately,  there  has  not 
been  very  much  success.  If  I  tell  you  that  the  next  word 
is  going  to  be  a  noun,  then  I  really  haven’t  told  you 
very  much.  The  following  example  illustrates  the  prob¬ 
lem. 

In  the  Optical  Character  Recognition  (OCR)  ^pli¬ 
cation,  it  is  likely  that  the  words  form  and  farm  might 
be  confused  by  the  channel  model.  Imagine,  for  exam¬ 
ple,  that  they  were  found  in  one  of  the  following  two 
contexts; 

federal  ^  )  credit 

(f  arm  )  , 

Most  j)cople  would  have  little  trouble  deciding  that  farm 
is  much  more  likely  in  the  first  context  and  that  form  is 
much  more  likely  in  the  second  context.  In  fact,  trigram 
models  also  have  little  difficulty  with  this  example. 
However,  a  syntactic  parser  wouldn’t  help  very  much. 
Tire  parser  might  tell  us  that  the  missing  word  is  a  noun, 
but  that  wouldn’t  help  distinguish  between  form  and 
farm  because  they  are  both  nouns.  In  general,  if  one 
were  to  compare  the  relative  importance  of  local  context 
versus  long-distance  dependencies,  one  would  almost 
certainly  find  that  the  local  context  is  much  more  impor¬ 
tant,  at  least  in  terms  of  predicting  the  next  word. 

The  linguistic  notion  of  syntax  (constraints  on 
nouns,  verbs,  subjects,  objects,  phrases,  etc.)  was  not 
intended  to  be  used  in  a  noisy  channel  model.  Chomsky 
has  always  been  more  interested  in  linguistic  com¬ 
petence  (an  idealization  of  syntax)  than  performance 
(deviations  that  are  found  in  the  real  world  including; 
word  frequencies,  word  association  norms,  collocations, 
statistical  preferences,  memory  and  computational  limi¬ 
tations,  etc).  It  should  not  be  surprising  that  pjerfor- 
mance  i.s.sues  are  impiortant  in  recognition  applications, 
and  consequently,  models  that  are  based  too  closely  on 
idealized  notions  of  syntactic  competence  are  likely  to 
nin  into  trouble  when  they  are  tested  on  real  data. 

7.  Entropy 

It  is  common  practice  to  evaluate  a  language 
model  on  the  basis  of  its  entropy.  The  standard  ascii 
code  uses  8  bits  to  represent  a  character.  Obviously, 
many  of  these  bits  are  unnecessary  since  some  letters 
arc  much  more  common  than  others.  If  one  were  to 
take  advantage  of  letter  frequencies  using  a  Huffman 
code  to  encode  each  letter  one  at  a  time,  then  it  would 
take  alx.»ut  r>  bits  to  code  each  character.  This  very  sim¬ 


ple  code  does  almost  as  well  as  the  Unix(TM)  compress 
program,  which  uses  the  Lemp)el-Ziv  algorithm  (Welch, 
1984). 

In  general,  models  based  on  words  achieve  much 
better  compression  than  models  based  on  characters.  A 
unigram  model  (a  Huffman  code  based  on  word  proba¬ 
bilities)  requires  about  2.1  bits  pter  character  (Brown, 
ptersonal  communication).  Note  that  the  unigram  model 
outrperforms  Lemptel-Ziv  by  a  considerable  margin, 
indicating  that  the  standard  Unix(TM)  compress  pro¬ 
gram  could  be  improved  significantly. 

The  trigram  model  achieves  even  better  compres¬ 
sion,  1.76  bits  p)er  character  (Brown  et  al.,  1991).  This 
last  model  is  remarkably  close  to  Shannon’s  estimate  for 
the  entropy  of  English.  However,  it  isn’t  exactly  fair  to 
compare  these  estimates  since  Shannon’s  estimate  was 
based  on  a  27  character  alphabet  whereas  these  other 
estimates  are  based  on  a  256  character  alphabet. 
Nevertheless  there  does  seem  to  be  some  reason  to 
believe  that  the  trigrain  model  is  doing  quite  well,  and 
that  it  might  be  almost  as  good  as  native  spjeakers  in 
predicting  the  next  letter. 

Table  5:  Entropy  of  Various  Language  Models 


Model _ Bits  /  char 

Ascii  8 

Huff  man  code  each  char  5 

Lempel-Ziv  (Unix('IM)  compress)  4.43 

Unigrain  2.1 

Trigram  1.76 

Shannon’s  Eistimate  1.25 


8.  Sp>arse  Data  “Fixes” 

As  mentioned  above,  the  sparse  data  problem  is 
probably  the  most  serious  weakness  with  the  trigram 
naodel.  In  fact,  there  are  usually  many  more  p)arameters 
than  data  points.  Let  V  be  the  number  of  typ>es  in  the 
vocabulary  and  V  be  the  number  of  tokens  in  the 
corpus.  Then  there  are  V®  parameters,  which  is  gen¬ 
erally  much  much  larger  than  N,  the  size  of  the  training 
set.  For  example,  in  the  Brown  Corpus,  there  are 
V®  ~  1.25  XIO*^  trigrams,  and  only  Af~  10*  tokens 
to  train  from.  Obviously,  most  of  the  possible  trigrams 
will  not  be  observed  in  the  training  corpus. 

One  might  think  tliat  one  could  fix  the  sparse  data 
problem  by  collecting  more  data,  but  ironically  V’®  gen¬ 
erally  grows  much  faster  than  N.  That  is,  if  you  collect 
a  larger  corpus  (more  tokens),  then  you  will  also  find 
more  typ>es  (vocabulary  items).  It  isn’t  exactly  clear 
how  these  two  function  grow,  but  1  believe  that  the 
vocabulary  grows  almost  linearly  with  corpus  size.  In 
any  case,  V®  grows  much  much  faster  tlian  N,  so  col¬ 
lecting  more  data  is  not  a  solution  to  the  sparse  data 
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problem. 

Something  has  to  be  done  about  the  sparse  data. 
Kate  (1987)  suggests  “backing-off”  from  the  trigram 
estimates  when  there  isn’t  enough  data.  Basically,  the 
idea  is  to  replace  trigram  estimates  with  a  combination 
of  unigram,  bigram  and  trigram  estimates.  This  is  obvi¬ 
ously  a  good  idea. 

One  can  also  try  to  reduce  the  number  of  parame¬ 
ters  by  grouping  words  into  classes  (e.g.,  parts  of 
speech,  synonym  sets,  etc.)  Brown  et  al.  (1990b)  sug¬ 
gest  building  classes  with  a  self-organizing  procedure 
which  joins  words  based  on  a  mutual  information  cri¬ 
terion.  The  criterion  has  the  effect  of  joining  together 
words  that  have  similar  distributions  (e.g.,  days  of  the 
week,  months  of  the  year,  etc).  Although  this  particular 
suggestion  is  very  intriguing,  it  probably  won’t  help  too 
much  with  the  sparse  data  problem  because  it  isn’t  pos¬ 
sible  to  determine  that  two  words  have  a  similar  distri¬ 
bution  unless  you  have  a  fair  number  of  examples  of 
both  words.  The  real  problem  is  what  to  do  with  words 
that  you  haven’t  seen  very  often  in  the  training  set. 
Worse,  what  do  you  do  with  words  that  you  haven’t 
seen  at  all.  The  criterion  for  joining  words  cannot 
depend  on  data  that  is  unavailable. 

fl.  MLE,  ADDl,GTandHO 

Finally,  one  can  “adjust”  frequency  counts,  espe¬ 
cially  when  they  are  small.  In  principle,  n-gram  probar 
bilities  can  be  estimated  from  a  large  sample  of  text  by 
counting  the  number  of  occurrences  of  each  n-gram  of 
interest  and  dividing  by  the  size  of  the  training  sample. 
This  method,  which  is  known  as  the  “Maximum  Likeli¬ 
hood  Estimator,”  (MLE)  is  veiy  simple.  However,  it  is 
unsuitable  because  n-grams  which  do  not  occur  in  the 
training  sample  are  assigned  zero  probability.  This  is 
qualitatively  wrong  for  use  as  a  prior  model,  because  it 
would  never  allow  the  n-gram,  while  clearly  some  of  the 
unseen  n-grams  will  occur  in  other  texts.  For  non-zero 
frequencies,  the  MLE  is  quantitatively  wrong. 

Three  alternatives  will  be  mentioned  here.  These 
methods  ail  take  the  observed  counts  (r)  and  produce  an 
adjusted  count  (r*).  The  last  two  methods  also  make 
use  of  Nf,  the  number  of  types  that  occur  exactly  r 
times. 


r*  =r 

MLE 

N 

r*  -(r  +1)  ^ 

'  ’  N  +S 

ADDl 

GT 

r*  =  CJN, 

HO 

The  first  method,  ADDl  (Jieffreys,  1948),  simply 
adds  one  to  all  of  the  observed  counts  and  then  adjusts 
the  total  appropriately  by  multiplying  by  N/{N-^) 
where  S  is  the  number  of  types  (e.g.,  F®).  This  method 
is  generally  a  disaster,  especially  when  S  is  much  larger 
than  N,  which  is  most  of  the  time.  In  a  spelling  correc¬ 
tion  application.  Gale  and  1  have  found  that  this  method 
produced  very  misleading  estimates  and  concluded  that 
estimating  the  context  badly  can  be  worse  than  not 
estimating  the  context  at  all  (Church  and  Gale,  1990). 

The  second  method,  GT  (Good,  1953),  depends 
only  on  the  modest  assumption  that  ngrams  have  bino¬ 
mial  distributions.  Unfortunately,  even  this  modest 
assumption  turns  out  to  be  highly  problematic.  Words 
and  ngrams  are  like  busses  in  New  York  City;  they  are 
social  animals  and  like  to  travel  in  packs.  The  word 
earthquake,  for  example,  has  a  very  bursty  distribution 
in  the  Associated  Ptess  (AP)  Newswire,  depending  on 
whether  or  not  there  has  recently  been  an  earthquake. 
The  word  turkey  also  has  a  bursty  distribution  in  the 
AP,  with  a  burst  appearing  once  a  year  in  late 
November.  In  fact,  one  can  show  that  the  binomial 
assumption  is  often  seriously  off  depending  on  what 
hapjrens  to  be  in  the  news,  among  other  things. 

The  last  method,  HO  held-out  estimate  (Jelinek 
and  Mercer,  1985),  assumes  the  least,  merely  that  the 
training  and  test  corpora  are  generated  by  the  same  pro¬ 
cess.  This  method  splits  the  text  into  two  halves  and 
uses  the  first  half  to  determine  .% ,  the  number  of  types 
that  occur  r  times,  and  the  second  half  to  determine 
their  total  mass  (7,.  r*  is  then  simply  set  to  Cf/N^. 
For  example,  to  determine  0*,  the  adjusted  count  for 
ngrams  that  did  not  occur  in  the  first  half,  one  would 
compute  Cq,  the  total  count  in  the  second  half  for 
ngrams  that  did  not  ^pear  in  the  first  half,  and  divide 
by  Nq,  the  number  of  ngram  types  that  did  not  rqrpear  in 
the  first  half. 

In  (Church  and  Gale,  1991b),  we  compared  the 
GT  and  HO  methods  for  estimating  bigram  frequencies 
in  22  million  words  of  Associated  Press  Newswire  and 
found  that  the  GT  method  was  slightly  better  when  the 
binomial  assumption  was  appropriate.  Tables  6  and  7 
show  that  both  methods  produce  remarkably  similar  esti¬ 
mates  for  r*. 


Table  6:  Good-Turing  (GT)  Estimate 
r  Nr  r* 


0 

1 

2 

3 

4 


74,671,100,000  0.0000270 

2,018,046  0.446 

449,721  1.26 

188,933  2.24 

105,668  3.24 
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Table  7;  Held-Out  (HO)  Estimate 
r  Nr  Cr  r* 


0 

1 

2 

3 

4 


74,671,100,000 

2,018,046 

449,721 

188,933 

105,668 


2,019,187  0.0000270 

903,206  .448 

564,153  1.25 

424,015  2.24 

341,099  3.23 


The  agreement  of  the  two  methods,  though,  is 
partly  due  to  the  fact  that  we  took  extraordinary  meas¬ 
ures  to  control  for  the  New  York  City  bus  effect  That 
is,  we  spit  the  text  into  two  samples  by  randomly 
assigning  each  bigram  to  one  of  the  two  samples.  'Hiis 
effectly  destroyed  any  time  structure  that  might  have 
existed  in  the  two  samples.  If  we  had  split  the  text  into 
two  halves  sequently  by  assigning  the  first  six  months  of 
the  newswire  to  the  first  half  and  the  second  six  months 
to  the  second  half,  then  we  would  have  observed 
significant  differences  due  to  the  non-binomial  nature  of 
the  news. 

Table  8  shows  that  there  is  considerable  agree¬ 
ment  when  the  text  is  split  randomly.  The  1-scores  are 
possibly  somewhat  larger  than  we  would  like,  but  they 
are  really  not  too  bad  considering  that  we  are  dealing 
with  extremely  infrequent  events.  The  <-scores  are  com¬ 
puted  using  an  estimate  of  variances  which  is  described 
in  (Church  and  Gale,  1991b).  Table  9  shows  that  there 
is  considerable  disagreement  if  the  texts  are  split 
sequentially. 


r 

Table  8:  Split  Text  Randomly 
HO  GT 

t 

0 

.000027041 

.000027026 

-.7 

1 

.4476 

.4457 

-2.9 

2 

1.254 

1.260 

2.5 

3 

2.244 

2.237 

-1.5 

4 

3.228 

3.236 

1.0 

5 

4.21 

4.23 

1.8 

6 

5.23 

5.19 

-2.8 

Table  9:  Split  Text  Sequentially 

r 

HO 

GT 

t 

0 

0.00001684 

0.0001132 

479.4 

1 

0.4076 

0.5259 

113. 

2 

1.0721 

1.2378 

47.0 

3 

1.9742 

2.2685 

37.8 

4 

2.8632 

3.1868 

26.4 

5 

3.7982 

4.2180 

25.8 

6 

4.7822 

5.2221 

15.4 

In  summary,  there  are  quite  a  number  of  very 
powerful  techniques  such  as  GT  and  HO  for  estimating 
the  probability  of  an  n-gram  that  did  not  appear  very 
many  times  in  the  training  corpus,  if  at  all.  These 


methods  appear  to  work  remarkably  well  when  the 
assumptions  are  met,  but  unfortunately,  there  are  serious 
problems  with  the  assumptions.  There  has  recently  been 
some  interest  in  adaptive  models,  models  that  can  take 
advantage  of  recency  effects  and  forgetting  effects.  If 
words  were  binomially  distributed,  then  the  probability 
of  a  word  should  be  independent  of  how  long  it  has 
been  since  it  was  last  mentioned.  In  the  AP  wire,  it 
appears  that  the  probability  increases  dramatically  when 
a  word  has  been  mentioned  recently,  and  drops  fairly 
consistently  with  the  length  of  time  since  the  last  men¬ 
tion. 

10.  Translation  .^plications 

Section  1  discussed  the  use  of  noisy  channel 
methods  in  recognition  applications.  This  section  will 
show  how  the  same  methods  can  be  used  to  address 
translation  applications  such  as  Machine  Translation 
(MT).  The  approach  was  first  suggested  by  Weaver  in 
1949  and  is  currently  being  revived  by  Brown  et  at. 
(1990a).  If  you  would  like  to  trimslatc  words  in  a 
source  language,  W,  (e.g.,  French)  into  words  in  a  target 
language,  Wf  (e.g.,  English),  you  imagine  that  the 
source  words  W,  were  the  output  of  a  noisy  channel. 
The  translation  task  is  to  find  the  most  likely  input  to 
the  noisy  channel  given  the  observed  outputs. 

Nowj/  Channel  —  IF, 

Viewed  in  this  way,  translation  is  very  similar  to  recog¬ 
nition.  In  principle,  one  caji  recover  the  most  likely 
input  by  hypothesizing  all  possible  target  language  texts, 
Wf,  and  selecting  the  target  text  with  the  highest  score, 
where  scores  are  computed  by  basically  the  same  for¬ 
mula  as  above: 

ARGMAXPr(  B()  Pr(  IF,  j  VV;) 

This  information  theoretic  approach  to  machine 
translation  is  extremely  controversial  among  researchers 
in  machine  translation  because  it  questions  many  of  the 
basic  assumptions  that  have  dominated  the  field  since 
the  1950s  when  Chomsky  (1957)  and  others  pointed  out 
that  statistical  n-gram  methods  are  incapable  of  model¬ 
ing  certain  syntactic  constraints  such  as  agreement  over 
long  distances.  Brown  et  at.  (1990a)  argue  that  the  star 
tistical  approach  is  more  tractable  than  it  was  in  the 
1950s.  Computers  are  certainly  faster  than  they  were 
then.  In  addition,  and  probably  much  more  importantly, 
it  is  now  possible  to  find  large  amounts  of  parallel  text, 
text  such  as  the  Canadian  parliamentary  debates  which 
are  available  in  multiple  languages.  Brown  et  al.  esti¬ 
mate  Pr(VF,)  and  Pr(VF,  |VF()  by  computing  various 
statistics  over  these  parallel  texts.  Although  the 
approach  may  be  deeply  flawed  for  many  of  the  reasons 


that  were  disciBsed  in  thefJ^SOs,  there  b,  nevertheless, 
a  growing  community  of^  researchers  in  corpus-based 
linguistics  such  as  (Klavans  and  T^oukermann,  1990) 
who  are  becoming  convinced  that  the  approach  is  worth 
pursuing  because  there  is  a  very  good  chance  that  it  will 
produce  a  number  of  lexical  resources  that  could  be  of 
great  value  to  their  research. 

11.  Fart  of  Speech  Tbg^ng 

This  description  of  the  machine  translation  prob¬ 
lem  is  fairly  general  and  can  be  applied  to  quite  a 
number  of  transduction  problems.  Consider,  part  of 
speech  tagging,  for  example.  A  part  of  speech  tagger 
takes  an  input  sequences  of  words  such  as  The  table  is 
ready,  and  outpute  a  sequence  of  parte  of  speech  such 
as;  Article  Noun  Verb  Adjective.  The  problem  is  non¬ 
trivial  because  it  is  well-known  that  part  of  speech 
depends  on  context.  The  word  “table,”  for  example,  is 
usually  a  noun,  but  it  can  also  be  a  verb  in  some  con¬ 
texts  such  as:  The  chairman  will  table  the  motion. 

The  tagging  problem  can  be  viewed  as  a  translar 
tion  problem,  not  unlike  machine  translation.  Imagine 
that  we  have  a  sequence  of  parts  of  speech  P  that  go 
into  the  channel  and  produce  a  sequence  of  words  W. 
Our  job  is  to  try  to  determine  the  hidden  parte  of  speech 
P  given  the  observed  words  W. 

P  — ►  Noisy  Channel  — ♦  W 

As  before,  in  principle,  one  can  hypothesize  all  possible 
inputs  to  the  channel  and  score  them  by: 

ARGMAXPr{P)  Pr{  W\P) 

Again,  the  parameters  in  this  model  are  generally 
estimated  by  computing  various  statistics  over  large 
bodies  of  text  Both  Church  (1988)  and  DeRose  (1988) 
have  used  the  Tagged  Brown  Corpus  (Francis  and 
Kucera,  1982)  for  this  purpose,  which  is  particulariy 
convenient  because  it  comes  with  parte  of  speech  that 
were  check  by  hand.  deMarcken  (1990)  used  the 
Tagged  Lancaster/  Oslo-Bergen  Corpus  (LOB)  which 
also  comes  with  parte  of  speech.  Others  such  as  Jelinek 
(1985)  have  used  the  Baum-Welch  Algorithm  (Baum, 
1972)  to  estimate  the  parameters  from  raw  untagged 
text 

I  have  always  felt  that  hand-tagged  text  produces 
more  reliable  estimates,  and  recently  Merialdo  (1990) 
performed  an  experiment  which  seems  to  back-up  my 
suspicion.  He  estimated  the  parameters  using  some 
hand-tagged  data  and  then  ran  the  re-estimation  pro¬ 
cedure  and  compared  performance  before  and  after  re¬ 
estimation.  One  might  have  thought  that  re-estimation 
ought  to  improve  performance,  but  he  found  just  the 
opposite.  He  concludes  that  one  should  use  as  much 
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tagged  text  as  possible  to  estimate  the  parameters,  and 
one  should  resort  to  re-estimation  only  when  it  is  not 
possible  to  find  a  sufficient  amount  of  tagged  training 
material. 

There  are,  of  course,  many  other  “translation” 
rq>plications  that  are  very  analogous  to  machine  translar 
tion  and  part  of  speech  tagging  where  one  wants  to 
transduce  one  tape  of  symbols  into  another.  In  speech 
recognition,  for  example,  it  is  common  to  use  these 
noisy-channel  methods  to  translate  a  sequence  of  acous¬ 
tic  labels  (e.g..  the  output  of  a  filter  bank)  into  a 
sequence  of  phonetic  labels  (e.g.,  consonants  and 
vowels). 

12.  Conclufdons 

Quite  a  number  of  applications  have  been  men¬ 
tioned  in  just  a  few  pages:  spelling  correction,  speech 
recognition,  optical  character  recognition,  text  compres¬ 
sion,  machine  translation  and  part  of  speech  tagging.  Of 
course,  there  are  many  other  applications  that  should 
have  been  discussed,  especially  information  retrieval 
(Salton,  1989)  and  author  identification  (Mosteller  and 
Wallace,  1964),  but  there  just  wasn’t  enough  space  to 
say  everything. 

All  of  this  work  points  very  strongly  to  the  fact 
that  1950-style  empiricism  is  back  in  fashion.  1  have 
been  asked  to  explain  why,  and  I’m  not  sure  that  I  have 
a  good  answer.  Of  course,  it  is  possible  that  the  current 
interest  in  empiricism  is  just  a  fad  that  will  soon  fade 
away.  But,  I  would  like  to  believe  that  there  are  good 
reasons  for  the  revival.  One  can  point  to  huge  advances 
in  computational  power  since  the  1950s.  But,  even 
nxrre  importantly,  the  electronic  culture  has  now  per¬ 
meated  the  publishing  sector  to  such  an  extent  that  it  is 
no  longer  difficult  to  find  hundreds  of  millions  of  words 
of  text  in  electronic  form.  And  there  is  promise  of  bil¬ 
lions  of  words  in  the  very  near  future.  The  availability 
of  data  on  such  a  massive  scale  has  made  it  possible  to 
cany  out  experiments  that  just  weren’t  possible  back  in 
the  1950s.  Indeed,  many  of  the  experiments  discussed 
in  this  paper  would  not  have  been  possible  without  the 
availability  of  very  large  corpora. 
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Abstract 

Ad^tive  resonance  theory  (ART)  neural  networks  are 
being  developed  for  application  to  the  industrial 
engineering  p'oblem  of  group  technology  —  the  reuse  of 
engineering  designs.  Two  and  three  dimensional 
representations  of  engineering  designs  are  input  to  ART- 
1  neural  networks  to  produce  groups  or  families  of 
similar  parts.  These  representations,  in  their  basic  form, 
amount  to  bit  maps  of  the  design,  and  can  become  very 
large  when  the  design  is  represented  in  high  resolution. 
We  describe  a  "neural  database"  system  under 
development.  This  system  demonstrates  the  feasibility 
of  training  an  ART-1  network  to  first  cluster  designs 
into  families,  and  then  to  recall  the  family  when 
presented  a  similar  design.  This  application  is  of  large 
practical  value  to  industry,  making  it  possible  to  avoid 
duplication  of  design  efforts. 

Introduction 

Money  and  time  can  be  saved  by  manufacturing 
companies  when  engineering  designs  are  reused.  This  is 
particularly  true  in  companies  producing  large  systems, 
such  as  aircraft,  that  must  be  customized  to  varying 
layouts.  Often  the  same  design  is  inadvertently 
re-signed  at  great  expense.  This  can  happen  frequently 
in  large  systems  which  involve  teams  of  designers.  A 
new  designer  will  have  no  knowledge  of  a  previous 
designer's  work  unless  the  technology  exists  to  retrieve 
and  compare  designs.  In  industrial  engineering,  the 
study  and  implementation  of  such  retrieval  systems  is 
referred  to  as  group  technology. 

Several  basic  requirements  must  be  met  for  the  practical 
implementation  of  group  technology.  First,  the  designs 
must  exist  in,  or  be  convertible  to,  an  electronic 
description.  Second,  an  appropriate  criterion  must  be 
designed  to  determine  similarity  of  designs.  Third,  the 
search  algorithm  must  exceed  a  threshold  of  performance 
on  the  host  computer  to  provide  timely  responses  for  the 
user.  Fourth,  a  retrieval  system  should  output  the  best 
few  matches  for  consideration  by  the  human  designer. 
Fifth  and  final,  the  database  must  be  easily  maintainable 
and  updateable.  Few  traditional  database  technologies 


provide  all  of  these,  particularly  a  criterion  for 
measuring  the  similarity  of  geometrii^  shapes. 

In  the  following,  we  will  address  the  general  application 
of  neural  networks  to  the  group  technology  problem, 
where  the  designs  are  derived  frcm  a  CAD  system.  Later 
in  the  paper,  we  will  discuss  the  results  of  a  specific 
neural  database  architecture  that  finds  similar  marker  (ie. 
decals)  designs.  Markers  are  found  in  the  passenger 
compartments  and  service  bays  of  commercial  airliners, 
and  indicate  locations  of  services,  warnings,  and 
restrictions  to  people  who  move  and  work  in  and  around 
the  aircraft.  In  this  specific  system,  the  data  is  not 
derived  from  a  CAD  system,  but  is  acquired  from  paper 
drawings  of  the  markers  with  the  help  of  a  PC  ba^ 
optical  scanner,  and  is  transferred  to  the  network  in  raster 
fcamaL 

In  the  next  section,  we  describe  how  a  specific  artificial 
neural  network  can  meet  all  of  the  requirements  of  a 
group  technology  implementation.  We  will  assume  that 
there  exists  an  electronic  description  of  the  design 
information.  First  we  will  introduce  the  ART-1 
algorithm.  We  will  then  discuss  the  process  of 
information  translation  into  the  binary  representations 
needed  for  input  to  the  network.  A  modification  of  the 
simulation  is  mentioned  that  makes  use  of  data 
compression  techniques.  Finally,  the  markers  retrieval 
system  will  be  described. 

ART-1  Algorithm 

The  adaptive  resonance  theory  (ART)  neural  network 
model  was  develc^ied  by  Carpenter  and  Grossberg^.  The 
version  of  this  model  that  processes  binary  input 
patterns  is  referred  to  as  ART-1.  The  ART-1  neural 
network  model  is  canonically  represented  by  a  coupled 
set  of  ordinary  nonlinear  differential  equations^.  If 
appropriate  assumptions  are  made  about  the  relationship 
between  the  learning  rates  and  the  dynamical  time 
constants,  this  system  of  equations  can  be  replaced  by  a 
procedural  algorithm^.  This  "fast  learning"  mode  of 
ieaming  requires  that  the  learning  process  stabilize  each 
time  before  the  next  input  pattern  is  presented.  The 
impact  of  this  assumption  on  both  hardware  and 
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software  implementation  is  large:  the  computational 
steps  of  the  ^gorithm  can  now  be  directly  mapped  onto 
an  algorithmic  processor.  For  this  model,  there  is  no 
need  to  become  embroiled  in  the  implementation  issues 
of  dynamical  systems. 

The  basic  functionality  of  this  algorithm  is  to 
autonomously  place  input  patterns  into  clusters  or 
families.  These  patterns  are  represented  as  binary 
vectors.  Clusters  are  formed  and  modified  during  the 
training  process,  often  referred  to  as  "self-organizing” 
learning.  The  number  of  clusters  is  not  preset  at  the 
beginning,  but  is  determined  by  the  underlying  structure 
of  input  patterns  used  during  training  and  by  a  small  set 
of  network  parameters.  After  training,  the  network  is 
used  as  a  "neural  database",  being  queried  by  new  input 
patterns  to  find  the  closest  family.  Again,  the  input 
patterns  must  be  represented  as  binary  vectors. 

A  characteristic  of  this  self-organizing  neural  network  is 
the  formation  of  memory  templates  or  archetypes  during 
the  repeated  exposure  of  the  network  to  the  training  set. 
A  template  isolates  a  conjunctive  generalization^  of  the 
attributes  representing  the  member  patterns  in  that 
cluster.  If  the  input  pattern,  denoted  I,  is  found  to  be  a 
member  of  an  existing  cluster  after  a  search  of  neural 
memories,  then  this  pauem  is  added  to  the  membership 
list  for  that  cluster,  and  the  template  associated  with  this 
cluster  is  updated  to  include  the  features  of  the  new 
pattern.  The  updated  template  is  a  conjunction,  or  an 
"and",  of  the  matching  template  and  the  newly  added 
member  input  vector. 

On  the  other  hand,  if  I  is  new  to  the  system's 
memories,  then  a  new  cluster  is  formed  with  I  being  the 
first  member.  In  this  case,  the  new  template 
representing  the  new  cluster  becomes  I.  (That  is,  the 
archetype  for  a  group  with  one  member  is  the  member 
itselO-  This  process  proceeds  automatically  with  no 
outside  supervision,  finding  order  and  structure  in  the 
stream  of  input  patterns.  For  the  learning  process  to 
stabilize,  the  training  set  of  input  patterns  is  repetitively 
presented  to  the  network.  In  summary:  when  a  new 
input  vector  is  presented,  it  is  then  either  placed  into 
one  of  the  existing  clusters,  or  classified  as  a  novel 
pattern  and  added  to  a  new  cluster. 

During  the  search  of  the  memory  templates,  the  dot 
product  of  each  memory  template  with  the  input  vector 
is  computed,  as  are  the  vector  norms  of  each  template 
and  the  input  vector.  That  is, 

III 
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where  I  is  the  current  input  vector,  T^  is  the  k'** 
memory  template,  nc  is  the  current  number  of 


groupings,  •  is  the  dot  product,  and  I .  I  is  the  Li  norm. 
During  learning,  the  conjunction  of  the  template  and  the 
input  vector  must  be  calculated  as  well.  That  is, 

Tfc  <—  (1  o  Tk), 

where  n  is  bit-wise  "and”.  These  calculations  constitute 
a  major  portion  of  the  processing  load  of  the  ART- 1 
algorithm. 

The  Neural  Network  Approach 

Healy  and  Caudell  have  further  developed  the 
understanding  of  the  logical  functionality  of  the  ART-1 
network  and  have  developed  a  methodology  for  the 
design  of  macrocircuits  of  ART-1  network  modules'*. 
Through  the  study  of  these  logical  architectures,  we  have 
applied  ART-1  to  the  group  technology  problem.  In 
this  application,  the  network  is  trained  on  design 
representations  derived  directly  from  descriptions 
generated  by  such  computer  aided  design  (CAD) 
packages  such  as  CATIA  and  CAD-KEY.  Two,  three, 
and  higher  dimensional  descriptions  are  being  used  to 
represent  features  of  designs. 

The  CAD  system  usually  stores  a  "constructive 
description"  of  the  part.  That  is,  a  list  of  instructions 
that  tell  a  graphics  rendering  program  how  to  draw  a 
diagram  of  the  part.  The  diagram  tells  the  design 
engineer  how  this  part  fits  into  the  overall  system,  the 
manufacturing  engineering  how  to  design  the 
manufacturing  process  for  the  part,  and  the  field  service 
engineer  how  to  maintain  the  part  in  the  system.  From 
this  constructive  description,  a  transformed 
representation  must  be  produced  by  a  preprocessing 
system  to  become  the  input  for  the  neural  network.  The 
description  of  the  design  may  come  in  other  forms, 
including  raster  scanned  images  as  mentioned  above.  In 
this  later  case,  no  preprocessing  is  required. 

For  a  2D  designs,  such  as  a  sheet  metal  floor  stiffener  in 
an  aircraft,  the  simplest  transformed  representation  is  a 
binary  pixel  map  or  silhouette;  ones  where  there  is  solid 
material  and  zeros  where  there  is  none,  defined  over  a 
predefined  2D  graphical  view  port.  This  is  shown  in 
Figure  la.  The  view  port  is  a  window  on  2D  space. 
The  binary  pixel  map  is  smmg-out  or  rasterized  into  a 
binary  vector  by  concatenating  rows  of  pixels  from  the 
view  port.  This  vector  is  subsequently  fed  to  the  ART-1 
neural  network  simulator  for  clustering  into  families. 

Other  forms  of  information  may  be  represented  as  binary 
patterns.  For  example.  Figure  lb  illustrates  how  the 
position  of  fastener  holes  can  be  represented  in  a  view 
port  with  the  same  dimensions  of  the  silhouette,  but 
with  ones  in  the  neighborhood  of  a  hole,  and  zeros 
otherwise.  The  locations  and  degree  of  metal  bends  can 
be  represented  in  a  three  dimensional  "Hough  Space", 
where  the  first  two  axis  code  the  slope  and  intercept  of 
the  bend  line,  while  the  third  axis  codes  the  bend  angle. 
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In  this  case,  each  bend  line  would  be  represented  as  a 
single  point  in  a  3D  space.  If  the  angle  of  the  bend  is 
not  important,  then  a  bend  line  could  be  represented 
directly  in  a  viewport  as  with  the  silhouette.  This  is 
shown  in  Figure  Ic. 


Figure  1.  Three  representations  of  features  of  a 
design,  (a)  Is  the  silhouette  of  the  part,  (b)  is  the 
location  of  fastener  holes,  and  (c)  is  the  location  of 
hend  lines.  Each  of  these  are  converted  into  linear 
binary  vectors  for  input  to  the  neural  network. 


The  limitation  of  this  type  of  sparce  representation  is  in 
the  explosion  of  the  length  of  the  binary  input  vector. 
The  resolution  of  the  pixelization  determines  the  overall 
length  of  the  binary  input  vector.  The  resolution  also 
determines  the  accuraey  of  the  object  representation,  and 
if  too  coarse  it  will  strongly  affect  the  way  the  network 
groups  the  designs.  Even  though  the  bits  in  the  binary 
vector  can  be  "packed"  into  32  bit  integers  for  storage 
and  manipulation,  when  many  clusters  are  formed,  the 
total  size  of  the  vectors  will  tax  the  limits  of  small 
engineering  workstations. 

In  our  normal  simulation  of  the  ART-1  algorithm,  the 
vectors  and  templates  are  in  binary  form  before  the  dot 
products,  norms,  and  conjunctions  are  calculated.  A 
practical  group  technology  parts  retrieval  system  might 
be  expected  to  require  many  ART-1  modules  running 
with  many  hundreds  of  memory  templates  each.  A 
compounding  fact  is  that  the  range  of  engineering 
worlbtations  on  which  the  system  might  possibly  be 
deployed  include  relatively  low-end  PCs.  The  following 
section  briefly  introduces  a  modification  to  the  ART-1 
algorithm  that  allows  direct  operation  on  data 
compressed  input  vectors  and  memory  templates. 

A  Compressed  ART-1  Algorithm 

There  are  significant  advantages  to  applying  data 
compression  techniques  to  the  binary  representations 
used  in  this  ART-1  system.  First  of  all,  there  is  no 
random  "noise"  in  designs,  making  accurate  compression 
possible.  Second,  a  bit  map  of  a  design  will  quite 
frequently  have  long  strings  of  Ts  and  O's  as  the  material 
of  the  part  is  transited,  producing  potentially  large  data 
compression  ratios.  Finally,  the  neural  network 
simulation  will  have  fewer  actual  numbers  to  process 
per  part,  reducing  the  execution  times. 

In  this  work,  standard  run-length  encoding  is  u.scd.  For 
an  example,  see  Figure  2.  Although  other  more 


sophisticated  techniques  are  available,  it  is  the  low 
conversion  overhead  and  basic  simplicity  of  run-length 
encoding  that  makes  it  ideal  for  this  application.  A  run- 
length  algorithm  returns  a  list  of  integers  that  represents 
the  lengths  of  runs  of  consecutive  I's  and/or  O's  in  the 
binary  vector.  Efficient  linear  algorithms  exist  to 
compress  data  into  this  format.  With  the  assumption 
that  the  starting  value  of  the  list  is  known,  the  fact  that 
the  I's  and  O's  alternate  allows  this  list  to  be  stored 
without  the  actual  values  of  the  runs. 


0000001 11 11  111  111  111  11100000000111111  nil  1100000 

Bimry  vector 


Figure  2.  An  example  of  a  short  binary  vector.  The 
run-length  code  C  for  this  string  is  {6,17,8,12,5}  with 
byte  compression  ratio  of  8/5.  This  ratio  assumes  that 
the  uncompressed  vector  is  stored  in  compact  form  in 
8-bit  bytes,  and  that  the  maximum  length  of  a  single 
run  is  256.  The  bar  above  the  vector  symbolically 
indicates  the  location  of  strings  of  O's  and  I's,  and  is 
used  to  explain  the  compressed  algorithms  later  in  the 
text. 

The  ART-1  simulation  used  for  this  research  was 
modified  to  include  compressed  versions  of  the  vector 
operations  described  above.  The  input  patterns  are 
compressed  before  presentation  to  the  network.  The 
memory  templates  are  created  and  updated  directly  in 
compressed  form.  Data  compression  ratios  and 
execution  times  were  measured  for  both  compressed  and 
uncompressed  versions  of  the  simulations.  In  these 
experiments,  compression  ratios  of  up  to  20  were  found 
using  2-D  CAD  designs.  In  addition,  speedups  of  the 
ART-I  algorithm  of  upwards  of  100  were  measured  for 
3-D  CAD  designs.  These  improvements  are  important 
to  the  developers  and  end-users  of  these  neural  retrieval 
systems  because  it  makes  deployment  of  practical 
applications  on  existing  engineering  workstations 
possible. 

Neural  Database  Architecture 

For  the  group  technology  applications  considered  so  far 
in  our  research  group,  a  generic  system  architecture  is 
emerging.  This  can  be  seen  in  Figure  3.  The  basic 
components  are  1)  CAD  System  Interface,  2)  Parser,  3) 
Representation  Generator,  4)  Neural  Network 
macrocircuits,  and  5)  User  Interface. 

In  a  group  technology  system  the  lists  of  parts  which 
form  each  cluster  are  maintained  during  training.  When 
the  user  queries  the  system  with  a  new  design,  that 
design  is  presented  to  the  network  and  the  list  of  parts 
which  previously  grouped  in  the  same  cluster  are 
relumed. 


18  TJ*.  Caudell,  SJ>.G.  Smith,  and  S.  Tazuma 


The  functionality  of  the  Parser  is  to  extract  the  salient 
information  from  the  CAD  System  Interface.  I'ypically, 
this  interface  is  an  ASCII  data  file  containing  the 
constructive  description  of  the  part.  It  may  also  be  a 
raster  file  of  an  image.  The  extracted  information  might 
be  a  list  of  lines  and  arcs  defining  the  border  of  the  part, 
the  location  of  fastener  holes,  or  a  bit  map  of  the  design. 
Unfortunately,  the  structure  of  the  data  files  usually 
depends  on  the  style  and  consistency  of  the  user  of  the 
CAD  program,  making  multiple  searches  of  the  data 
necessary.  Sometimes  information  on  a  substructure  of 
the  part  will  be  distributed  in  many  locations  in  the 
CAD  file.  The  Parser  is  the  only  component  specific  to 
the  brand  of  CAD  program  being  used,  and  must  be 
redesigned  for  each  new  system. 


Neural  Netwofk 


Figure  3.  A  schematic  of  the  components  of  a  neural 
database  system  for  group  technology.  The  User 
Interface  provides  control  of  the  level  of  abstraction  of 
recall  in  the  network. 

The  Representation  Generator  converts  and  compresses 
the  information  extracted  by  the  Parser  into  a  form 
usable  by  the  neural  network.  This  includes  operations 
such  as  the  generation  of  the  2D  viewports,  generating 
silhouettes  by  filling  in  boundaries,  computing  the 
location  of  points  in  Hough  spaces,  and  the  compression 
of  each  representation  into  run-length  codes.  This 
component  is  independent  of  the  type  of  CAD  program 
used  in  design  generation,  but  will  vary  depending  on 
the  types  of  representations  required  to  capture  the 
significant  features  that  best  discriminates  the  design 
families. 

The  structure  of  the  ART-1  Macrocircuits  component  is 
also  dependent  on  the  representations,  and  will  vary 
according  to  requirements  of  the  database  users.  A 
macrocircuit  is  a  collection  of  neural  network  modules, 
connected  together  in  a  larger  and  more  functional 
network.  These  are  necessary  if  a  network  is  to  give  the 
user  a  range  of  query  options.  For  example,  the  user 
may  choose  to  query  the  database  for  designs  that  have 
the  same  general  size,  represented  by  a  bounding 
rectangle  or  box.  After  limiting  the  choices  of  families 
by  this  step,  the  user  may  next  want  to  discriminate 
according  to  the  the  specific  shape  of  the  object.  The 


structure  of  the  macrocircuit  strongly  effects  the  range  of 
functionality  provided  by  the  neural  database.  (See 
Figure  6  for  a  diagram  of  the  macrocircuit  used  in  the 
demonstrations  system  discussed  in  the  following 
section.) 
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Figure  4.  An  example  of  a  ART  Tree  database 
structure.  Each  cube  represents  a  macrocircuit  of 
ART-1  neural  networks  to  provide  "don't  care"  option 
to  query. 


Another  user  requirement  might  be  the  ability  to  vary 

the  degree  of  the  family  discriminators,  allowing  on-line 

specification  of  the  closeness  of  a  match  in  the  search 

for  similarity.  This  can  be  implemented  with  a 

hierarchical  abstraction  tree  of  macrocircuit  modules,  as 

shown  in  Figure  4.  Each  module  in  the  tree  is  trained 

separately  with  the  input  patterns  associated  only  with 

that  branch  cluster,  and  each  module  receives  the 

complete  set  of  representations.  The  modules  at  the  top 

of  the  tree  have  ^e  greatest  discrimination,  while  the  ' 

one  at  the  bottom  has  the  least.  When  a  query  occurs, 

the  lowest  module  places  the  design  into  one  of  its  ' 

families  or  clusters.  Families  at  this  level  represent  the  I 

most  general  abstraction  of  the  possible  set  of  designs 

stored  in  the  system.  When  a  winning  cluster  is  selected 

at  the  first  level,  the  module  up  the  branch  of  the  tree 

associated  with  this  group  is  activated.  This  module 

then  places  the  design  into  one  of  it's  clusters,  and  the 

process  repeats.  The  user  selects  the  level  of  abstraction 

at  retrieval  time  according  to  the  current  requirements. 

An  Application  to  Marker  Retrieval  I 

As  an  illustration  of  the  types  of  systems  currently 
under  development,  more  detail  will  be  given  on  the 
marker  design  retrieval  system.  Figure  5  gives  two  j 

markers  that  are  similar  in  size  and  textual  content,  but  i 

differ  in  the  graphical  information.  It  is  possible  that  < 

only  one  of  these  need  be  saved.  Often  new  markers  are  I 

needlessly  designed  because  no  retrieval  system  exists  to  i 

aid  the  designer.  The  markers  designs  are  produced  and 
stored  on  sheets  of  paper  bound  in  volumes, 
complicating  electronic  access. 

For  this  demonstration,  approximately  50  markers  were 

digitized  on  a  Macintosh  optical  scanner  to  capture  the  ] 
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graphical  shape.  These  images  were  then  converted  to 
raster  file  bit  maps  for  input  to  the  neural  network.  In 
addition,  the  cut-out  die  size  and  textual  content  of  the 
marker  were  recorded  with  the  image.  Figure  6  gives  the 
details  of  how  sets  of  ART-1  modules  are  connected  to 
implement  the  database  system. 

The  detailed  structure  of  this  macrocircuit  evolves  during 
the  learning  process,  where  a  training  set  of  marker 
designs  are  repetitively  presented  to  the  network.  The 
die  size  and  textual  information  are  used  to  form 
families.  When  a  new  size/text  family  forms,  an  ART-1 
module  is  created  to  cluster  the  gr^hics  associated  with 
this  family  into  subfamilies  of  similar  shapes.  In  Figure 
6,  the  shape  representation  is  considered  last  by  the 
highest  ART-1  module. 

One  advantage  of  this  sort  of  hierarchical  structure  is 
that  it  could  be  easily  incorporated  into  a  traditional 
database  system.  The  categorization  that  occurs  before 
presenting  the  graphical  images  to  the  neural  network 
could  be  performed  by  querying  an  existing  database. 
Thus,  any  attributes  of  the  markers  that  have  been 


entered  into  a  database  could  be  used  prior  to  graphical 
grouping. 

This  demonstration  system  mentioned  above  is 
implemented  in  the  C  language  on  Sun  SPARC 
workstations.  Training  for  this  small  system  takes  less 
than  ten  minutes,  and  retrieval  time  for  a  new  design  is 
less  than  a  second.  The  ART  Tree  structure  has  not 
been  implemented  for  this  application.  Figure  7  shows 
a  screen-dump  of  a  trained  netwwk. 

A  neural  network  grouping  system  for  airplane  markers 
could  be  used  in  a  number  of  ways.  The  existing 
markers  could  be  grouped  and  then  the  groups  examined 
by  a  human  to  locate  and  purge  duplicate  markers.  This 
would  save  money  in  maintenance.  Also,  such  a  system 
could  be  used  for  group  technology  to  return  the  closest 
existing  markers  to  a  new  one  being  designed.  This 
would  help  avoid  the  future  proliferation  of  duplicate 
markers.  Finally,  additions  to  traditional  databases  could 
be  constructed  which  would  graphically  group  the 
markers  returned  to  the  user  in  response  to  a  query. 
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Figure  5.  Two  markers  that  are  the  same  size  and  have  the  same  message,  but  contain  different 
graphical  information. 
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Figure  6.  The  macrocircuit  of  ART-1  modules  that  Implements  the  markers  design  retrieval  system. 
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Figure  7.  A  screen  dump  from  a  Sun  SPARCl  simulation  of  the  sheet  metal  floor  stiffener  retrieval 
system.  The  three  representations  of  shape,  holes,  and  bends  appear  in  viewports  across  the  top  of  the 
image.  The  set  of  silhouettes  in  the  upper  middle  of  the  figure  are  the  memory  templates  for  the  shape 
ART-1  module.  The  lower  set  of  rectangular  windows  show  the  results  of  the  holes  and  bends  modules 
for  each  shape  cluster. 
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Conclusions 

Artificial  neural  networks  have  been  applied  to  design 
retrieval.  ART-1  networks  are  used  to  adaptively  group 
together  similar  engineering  or  graphical  designs.  The 
information  used  to  group  <s  coded  into  a  binary 
representations  which,  in  their  basic  form,  amounts  to 
bit  maps  of  design  descriptors.  We  have  used  this 
technology  to  build  neural  databases  for  the  retrieval  of 
two  and  three  dimensional  engineering  designs.  We 
have  discussed  in  detail  a  feasibility  level  system  that 
learns  to  group  airliner  markers  into  families,  and  then 
to  recall  the  family  when  presented  a  similar  marker. 
The  input  to  these  networks  may  be  generated  directly 
from  CAD  designs  of  the  parts  or  other  sources  of  object 
features. 

An  addition  to  the  algorithmic  form  of  ART-1  was 
introduced  that  allows  it  to  operate  directly  on  run-length 
encoded  vectors,  and  to  generate  compressed  memory 
templates.  When  compared  to  the  regular  uncompressed 
algorithm  on  real  engineering  designs,  the  performance 
of  this  compressed  algorithm  demonstrated  a  significant 
savings  in  storage  of  the  input  vector  and  the  memory 
templates.  A  surprising  result  was  the  size  of  the  speed 
up  in  execution  of  the  simulation  on  larger  input 
vectors.  Issues  of  object  scale,  orientation,  and 
reflection  have  not  been  discussed  here,  although  they 
have  been  dealt  with  in  the  working  systems.  The  code 
for  a  system  that  groups  aircraft  floor  stiffener  sheet 
metal  parts  has  been  transferred  to  a  PC  based 
engineering  workstation  for  beta  testing.  The  application 
of  neural  networks  to  group  technology  is  of  large 
practical  value  to  industry,  by  making  it  possible  to 
avoid  duplication  of  design  efforts  and  save  many  down 
stream  costs. 
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^  Abstract 

This  paper  discuses  multiple  Bayesian  networks  repre¬ 
sentation  paradigms  for  encoding,  asymmetric  indepen¬ 
dence  assertions.  V*'e  offer  three  contributions:  (l)  an  in¬ 
ference  mechanism  that  makes  explicit  use  of  asymmetric 
independence  to  speed  up  computations,  (2)  a  simplified 
definition  of  similarity  networks  and  extensions  of  their 
theory,  and  (3)  a  generalized  representation  scheme  that 
encodes  more  types  of  asymmetric  independence  asser¬ 
tions  than  do  similarity  networks.  . 

Introduction 

TVaditional  probabilistic  approaches  to  diagnosis,  classi¬ 
fication,  and  pattern  recognition  face  a  critical  choice: 
either  specify  precise  relationships  between  all  interact¬ 
ing  variables  or  make  uniform  independence  £issumptions 
throughout.  The  first  choice  is  computationally  infeasi¬ 
ble  except  in  very  small  domains,  while  the  second,  which 
is  r«irely  justified,  often  yields  inadequate  conclusions. 

Bayesian  networks  offer  a  compromise  between  the  two 
extremes  by  encoding  independence  when  possible  and 
dependence  when  necessary.  They  allow  a  wide  spectrum 
of  independence  assertions  to  be  considered  by  the  model 
builder  so  that  a  practical  balance  can  be  established  be¬ 
tween  computational  needs  and  adequacy  of  conclusions. 

Although  Bayesian  networks  considerably  extend  tra¬ 
ditional  approaches,  they  are  still  not  expressive  enough 
to  encode  every  piece  of  information  that  might  re¬ 
duce  computations.  The  most  obvious  omissions  are 
asymmetric  independence  assertions  stating  that  vari¬ 
ables  are  independent  for  some  but  not  necessarily  for 
all  of  their  values.  Such  asymmetric  assertions  cannot 
be  represented  naturally  in  a  Bayesian  network.  Sev¬ 
eral  researchers  observed  this  limitation,  however,  until 
recently  no  effort  Wcis  made  to  remove  it. 

Similarity  network  paradigm  is  the  first  major  effort 
towards  the  r  '  resentation  of  asymmetric  independence 
[Heckerman,  }  /90].  Contingent  influence  diagrams  is  an 

'This  paper  is  reprinted  from  the  pioceedings  of  the  7th 
Uncertainty  in  Artificial  Intelligence  conference,  Los  Angeles, 
California. 
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alternative  approach  (Fung  and  Shachter,  1991].  Both 
schemes  employ  asymmetric  independence  to  ease  the 
elicitation  and  improve  the  quality  of  probabilistic  mod¬ 
els. 

This  article  offers  three  contributions:  (l)  an  infer¬ 
ence  mechanism  that  makes  explicit  use  of  asymmetric 
independence  to  speed  up  computations,  (2)  a  simplified 
definition  of  simil2U'ity  networks  and  extensions  of  their 
theory,  and  (3)  a  generalized  representation  scheme  that 
encodes  more  types  of  asymmetric  independence  asser¬ 
tions  than  do  similarity  networks. 

These  contributions  address  problems  of  knowledge 
representation,  inference,  and  knowledge  acquisition.  In 
particular,  Section  2  describes  Bayesian  multinets  and 
how  to  use  them  for  inference.  Section  3  describes  knowl¬ 
edge  acquisition  using  similarity  networks  and  how  to 
convert  them  to  Bayesian  multinets.  Section  4  extends 
these  representation  schemes  to  the  case  where  hypothe¬ 
ses  are  not  mutually  exclusive  and  section  5  summeu'izes 
the  results.  We  assume  the  reader  is  familiar  with  the 
definition  and  usage  of  Bayesian  networks.  For  details 
consult  [Pearl,  1988). 

Representation  and  Inference 
Bayesian  Multinets 

The  following  example  demonstrates  the  problem  of  rep¬ 
resenting  asymmetric  independence  by  Bayesian  net¬ 
works: 

A  guard  of  a  secured  building  expects  three  types  of 
persons  to  approach  the  building’s  entrance:  work¬ 
ers  in  the  building,  approved  visitors,  and  spies.  As 
a  person  approaches  the  building,  the  guard  notes 
its  gender  and  whether  or  not  the  person  wears  a 
badge.  Spies  are  mostly  men.  Spies  always  wear 
badges  in  order  to  fool  the  guard.  Visitors  don’t 
wear  badges  because  they  don’t  have  one.  Female- 
workers  tend  to  wear  badges  more  often  than  do 
male-workers.  The  task  of  the  guard  is  to  identify 
the  type  of  person  approaching  the  building. 

A  Bayesian  network  that  represents  this  story  is  shown 
in  Figure  1.  Variable  h  in  the  figure  represents  the  cor- 
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rect  identification.  It  has  three  values  w,  v,  and  s  re¬ 
spectively  denoting  worker,  visitor,  and  spy.  Variables  g 
and  b  are  binciry  variables  representing,  respectively,  the 
person’s  gender  and  whether  or  not  the  person  wears  a 
badge.  The  links  from  h  to  g  and  from  h  to  b  reflect  the 
fact  that  both  gender  and  badge-wearing  are  clues  for 
correct  identification,  and  the  link  from  g  to  b  encodes 
the  relationship  between  gender  and  badge-wearing. 

Unfortunately,  the  topology  of  this  network  hides  the 
fact  that,  independent  of  gender,  spies  always  wear 
badges  and  visitors  never  do.  The  network  does  not 
show  that  gender  and  badge-wearing  are  conditionally 
independent  given  the  person  is  a  spy  or  a  visitor.  A 
link  between  g  and  b  is  drawn  merely  because  gender 
and  badge-wearing  are  related  variables  when  the  per¬ 
son  is  a  worker. 


Figure  1;  A  Bayesian  network  for  the  secured-building 
example. 

We  can  more  adequately  represent  this  story  using  two 
Bayesian  networks  shown  in  Figure  2.  The  first  net¬ 
work  represents  the  cases  where  the  person  approaching 
the  entrance  is  either  a  spy  or  a  visitor.  In  these  cases, 
badge-wearing  depends  merely  on  the  type  of  person  ap¬ 
proaching,  not  on  its  gender.  Consequently,  nodes  6  and 
g  are  shown  to  be  conditionally  independent  (node  h 
blocks  the  path  between  them).  The  links  from  h  to 
b  and  from  h  to  g  in  this  network  reflect  the  fact  that 
badges  and  gender  are  relevant  clues  for  distinguishing 
between  spies  and  visitors.  The  second  network  repre¬ 
sents  the  hypothesis  that  the  person  is  a  worker,  in  which 
case  gender  and  badge-wearing  are  related  as  shown. 


Figure  2:  A  Bayesian  multinet  representation  of  the 
secured-building  story. 

Figure  2  is  a  better  representation  than  Figure  1  be¬ 
cause  it  shows  the  dependence  of  badge-wearing  on  gen¬ 
der  only  in  context  in  which  such  a  relationship  exists, 
namely,  for  workers.  Moreover,  the  former  represen¬ 


tation  requires  11  parameters  while  the  representation 
of  Figure  2  requires  only  9.  This  gain,  due  to  asym- 
metric  independence,  could  be  substantially  larger  for 
real-sized  problems  because  the  number  of  parameters 
needed  grows  exponentially  in  the  number  of  variables, 
whereas  the  overhead  of  representing  multiple  networks 
grows  only  linearly. 

We  call  the  representation  scheme  of  figure  2,  a 
Bayesian  multinet. 

Definition  Let  {ui...u„}  be  a  finite  set  of  variables 
each  having  a  finite  set  of  values,  P  be  a  probability  dis¬ 
tribution  having  the  Cartesian  product  of  these  sets  of 
values  as  its  sample  space,  and  h  be  a  distinguished  vari¬ 
able  among  the  u/s  that  represents  a  mutually-exclusive 
and  exhaustive  set  of  hypotheses.  Let  Ai,...,Ak  be  a 
partition  of  the  values  of  h.  A  directed  acyclic  graph 
Di  is  called  a  local  network  of  P  (associated  with  Aj) 
if  it  is  a  Bayesian  network  of  P  given  that  one  of  the 
hypotheses  in  Ai  holds,  i.e.,  Di  is  a  Bayesian  network  of 
P(ui . . .  u„|Ai).  The  set  of  k  local  networks  is  called  a 
Bayesian  multinet  of  P.^ 

In  the  secured-building  example  of  Figure  2, 
{{spy,  visitor),  {worker}}  is  a  partition  of  the  values  of 
the  hypothesis  node  h,  one  local  network  is  a  Bayesian 
network  of  P(h,  b,  y|  worker)  and  the  other  local  network 
is  a  Bayesian  network  of  P{h,  b,  g\  {spy,  visitor}).  ^ 

The  fundamental  idea  of  multinets  is  that  of  condition¬ 
ing;  each  local  network  represents  a  distinct  situation 
conditioned  that  hypotheses  are  restricted  to  a  speci¬ 
fied  subset.  Savings  in  computations  and  space  occur 
because,  as  a  result  of  conditioning,  asymmetric  inde¬ 
pendence  assertions  are  encoded  in  the  topology  of  the 
local  networks.  In  the  example  above,  conditional  inde¬ 
pendence  between  gender  and  badge-wearing  is  encoded 
as  a  result  of  conditioning  on  h. 

Notably,  conditioning  may  also  destroy  independence 
relationships  rather  then  create  them  [Pearl,  1988]. 
However,  if  the  distinguished  variable  is  a  root  node  (i.e., 
a  node  with  no  incoming  links),  conditioning  on  its  val¬ 
ues  never  decreases  and  often  increases  the  number  of  in¬ 
dependence  relationships,  resulting  in  a  more  expressive 
graphical  representation.  Other  situations  are  addressed 
below  where  the  hypothesis  variable  is  not  a  root  node 
or  where  more  than  one  node  represents  hypotheses. 

Representational  and  Computational 
Advantages 

The  vanishing  dependence  between  gender  and  badge- 
wearing  is  an  example  of  an  hypothesis- specific  indepen¬ 
dence  because  it  is  manifest  only  when  conditioning  on 

‘A  Bayesian  multinet  roughly  corresponds  to  an 
hypothesis-specific  similarity  network  as  defined  in  Hecker- 
nian’s  dissertation  (1990,  page  76). 

^The  conditioning  set  {spy,  visitor)  is  a  short  hand  nota¬ 
tion  for  saying  that  h  draws  its  values  horn  this  set,  namely, 
either  h  =  spy  or  h  =  visitor. 
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specific  hypotheses,  that  is,  for  spies  and  visitors,  but 
not  for  workers.  The  following  variation  of  the  secured- 
building  example  demonstrates  an  additional  type  of 
asymmetric  independence  that  can  be  represented  by 
Bayesian  multinets  as  well. 

The  gueird  of  the  secured  building  now  expects  four 
types  of  persons  to  approach  the  building’s  en¬ 
trance:  executives,  regular  workers,  approved  vis¬ 
itors,  and  spies.  The  guard  notes  gender,  badge- 
wearing,  and  whether  or  not  the  person  arrives  in 
a  limousine  (1).  We  assume  that  only  executives 
arrive  in  limousines  and  that  male  and  female  ex¬ 
ecutives  wear  badges  just  m  do  regular  workers  (to 
serve  as  role  models). 

This  story  is  represented  by  the  two  local  networks 
shown  in  Figure  3.  One  network  represents  a  situation 
where  either  a  spy  or  a  visitor  approaches  the  building, 
and  the  other  network  represents  a  situation  where  either 
a  worker  or  an  executive  approaches  the  building.  The 
link  from  h  to  1  in  the  latter  network  reflects  the  fact  that 
arriving  in  limousines  is  a  relevant  clue  for  distinguishing 
between  workers  and  executives.  The  absence  of  this 
link  in  the  former  network  reflects  the  fact  that  it  is  not 
relevant  for  distinguishing  between  spies  and  visitors. 

The  vanishing  dependence  between  gender  and  the  hy¬ 
pothesis  variable  h  when  h  is  restricted  to  a  subset  of 
hypotheses  {worker,  executive)  is  an  example  of  subset 
independence.  Similarly,  badge-wearing  is  independent 
of  h  when  restricted  to  [worker,  executive},  and  arriv¬ 
ing  in  limousines  is  independent  of  h  when  restricted  to 
[spy,  visitor}.  ^ 

Subset  independence  is  a  source  of  considerable  com¬ 
putational  savings.  For  example,  in  lymph-node  pathol¬ 
ogy  less  than  20%  of  the  potential  morphological  findings 
are  relevant  for  distinguishing  any  given  pair  of  disease 
hypotheses  (among  over  60  diseases)  [Heckerman,  1990). 


Figure  3:  A  Bayesian  multinet  representation  of  the  aug¬ 
mented  secured-building  story. 

Below  we  demonstrate  these  computational  savings 
using  the  simple  secured-building  example;  more  sav¬ 
ings  are  obtained  in  real  domains  such  as  lymph-node 
pathology. 

^Heckerman  coined  the  terms  subset  independence  and 
hypothesis-specific  independence  in  his  dissertation. 


Suppose  the  guard  sees  a  male  (g)  wearing  a  badge  (b) 
approaches  the  building  and  suppose  the  guard  doesn’t 
notice  whether  or  not'  the  person  arrives  in  a  limousine. 
A  computation  of  the  posterior  probability  of  each  possi¬ 
ble  identification  (executive,  worker,  visitor,  spy)  based 
on  the  Bayesian  network  of  Figure  1  simply  yields  the 
chaining  rule: 

P{h\g,  h)  =  K  P(h)  ■  P(g\h)  •  P(b|g,  h).  (1) 

where  K  is  the  normalizing  constant. 

Using  the  representation  of  Figure  3,  however,  the  fol¬ 
lowing  more  efficient  computations  are  done  instead: 


P(spy\g,  h)  =  K  P(spy)  P(glspy)  •  P(b|spy)  (2) 

P(t)tsi<or|g,  h)  =  K  P{visitor)  ■  P(g| visitor) • 

P(b|  visitor)  (3) 

P(worker\g,  b)  =  iC  •  P(tvorker)  ■  P(g|toorifcer)- 

P(b|g,  worker)  (4) 

P{g,  bjeiecutiwe)  =  P(g,b|  worker).  (5) 


Elquations  2  and  3  take  advantage  of  an  hypothesis- 
specific  independence  assertion,  namely,  that  g  and  b 
are  conditionally  independent  given,  respectively,  that 
h  =  spy  and  h  —  visitor.  Equation  5  uses  a  subset  inde¬ 
pendence  assertion,  namely,  that  b  and  g  are  independent 
of  h  restricted  to  (worker,  executive}. 

More  generally,  calculating  the  posterior  probability  of 
each  hypothesis  based  on  a  set  of  observations  ei,  ...,6^. 
is  done  in  two  steps.  First,  for  each  hypothesis  h,-,  the 
probability  P(ei, ...,  e^lhi)  is  computed  via  standard  al¬ 
gorithms  such  as  Spiegelhalter  and  Lauritzen’s  (88)  or 
Pearl’s  (88).  Second,  these  results  are  combined  via 
Bayes’  rule: 

P(hi|ei...em)  =  K  p(hi)P{ei...ek\hi).  (6) 

Notably,  the  computation  of  P(ei . . .  efc(Ai)  in  the  first 
step  uses  the  local  networks  as  done  in  Eqs.  (2)  through 
(5)  and  does  not  use  a  single  Bayesian  network  as  done 
in  Eq.  (1).  Consequently,  when  the  values  of  h  are  prop¬ 
erly  partitioned,  the  extra  independence  relationships 
encoded  in  each  local  network  could  considerably  reduce 
computations. 

The  parameters  needed  to  perform  the  above  compu¬ 
tations  consist,  as  we  shall  see  next,  of  the  prior  of  each 
hypothesis  hi  and  the  parameters  encoded  in  the  local 
networks: 

Theorem  1  Let  (uj . .  ,u„}  be  a  finite  set  of  variables 
each  having  a  finite  set  of  values,  P  be  a  probability  dis¬ 
tribution  having  the  Cartesian  product  of  these  sets  of 
values  as  its  sample  space,  h  be  a  distinguished  variable 
among  the  UiS,  and  M  be  a  Bayesian  multinet  of  P. 
Then,  the  posterior  probability  of  every  hypothesis  given 
any  value  combination  for  the  variables  tn  {ui...Un} 
can  be  computed  from  the  prior  probability  of  h ’s  values 
and  from  the  parameters  encoded  in  M . 
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According  to  Eq.  6  above,  the  only  parameters  needed 
for  computing  the  posterior  probability  of  each  hypoth¬ 
esis  hi,  aside  of  the  priors,  are  p{v2  . . .  v„\hi)  where 
V2  ■■  -  Vn  are  arbitrary  values  of  U2  . . .  u„  (assuming  with¬ 
out  loss  of  generality  that  h  =  ui).  Let  X),  denote  a  local 
network  in  M,  Ai  be  the  hypotheses  associated  with  Di, 
and  hi  be  an  hypothesis  in  Ai.  Clearly,  p{v2  •  ■  ■  *in\hi) 
is  equal  to  p(t;2  . . .  Ai)  because  hi  logically  implies 
the  disjunction  over  all  hypotheses  in  Ai.  The  latter 
probability  is  computable  from  the  local  network  Di  by 
any  standard  algorithm  (e.g.,  [Pearl,  1988]),  thus,  the 
former  is  also  computable  as  needed.  □ 

For  example,  P(Q\worker,  {worker,  executive})  is 
equal  to  the  probability  P(g\worker)  because  worker  log¬ 
ically  implies  the  disjunction  worker  V  executive.  In 
fact,  P(^\worker,  {worker,  executive})  is  also  equal  to 
P(s\{worker,  executive})  because  g  and  worker  axe  inde¬ 
pendent  given  {worker,  executive}  as  shown  in  Figure  3. 
In  this  example,  the  needed  probability  P(g|u/orJfcer)  is 
equal  to  the  given  one  P(g|{tyorA:er,  executive}),  how¬ 
ever  in  general,  the  needed  probabilities  are  computed 
via  standard  inference  algorithms. 

Overcoming  some  Limitations 

The  multinet  approach  described  thus  far  is  especially 
beneficial  when  the  hypothesis  variable  can  be  modeled 
as  a  root  node  because,  then,  no  dependencies  are  ever 
introduced  by  conditioning  on  the  different  hypotheses. 
However,  the  hypothesis  node  cannot  always  be  modeled 
as  a  root  node.  For  example,  in  the  secured-building 
story,  suppose  there  are  two  independent  reports  indi¬ 
cating  possible  spying,  say,  for  military  and  economical 
reasons  respectively.  Such  a  priori  factors  for  correct 
identification  are  modeled  as  parent  nodes  of  h,  called, 
say,  economics  and  military  having  no  link  between  them 
to  show  their  mutual  independence.  The  resulting  net¬ 
work  in  this  case  is  simply  economics  — >  h  <—  military. 

However  when  h  assumes  the  value  spy,  an  induced 
link  is  introduced  between  its  parents  economics  and 
military,  one  explanation  for  seeing  a  spy  changes  the 
plausibility  of  the  other  explanation,  thus  making  the 
two  variables  economics  and  military  be  not  indepen¬ 
dent  conditioned  on  h  =  spy.  Consequently,  an  induced 
link  must  be  drawn  between  the  economics  and  military 
nodes  in  the  local  network  for  spies  vs.  visitors  to  ac¬ 
count  for  the  above  dependency.  This  link  would  not 
appear  in  the  full  Bayesian  network  because  economics 
and  military  are  marginally  independent  (they  become 
dependent  only  when  conditioning  on  h  =  spy).  Such 
induced  links  are  often  hard  to  quantify  and  therefore, 
constructing  a  single  local  network  is  sometimes  harder 
than  constructing  the  full  network,  as  is  the  case  in  the 
above  example. 

One  approach  to  handle  this  situation  is  to  first  con¬ 
struct  a  Bayesian  network  that  represents  only  a  priori 
factors  that  influence  the  hypotheses,  ignoring  any  ev¬ 


idential  variables  (such  as  gender,  badge-wearing,  and 
limousines) .  In  our  example,  this  network  would  be  eco¬ 
nomics  h  *—  military.  Then,  use  this  network  to  revise 
the  a  priori  probabilities  of  the  different  hypotheses.  Fi¬ 
nally,  construct  local  networks  ignoring  a  priori  factors 
(as  done  in  Figure  2)  and  use  the  resulting  multinet  with 
the  revised  priors  of  h  to  compute  the  posterior  proba¬ 
bility  of  h  as  determined  by  the  evidential  clues.  This 
decomposition  technique  works  best  if  a  priori  factors 
are  independent  of  all  clues  conditioned  on  the  different 
hypotheses.  That  is,  in  situations  that  can  be  modeled 
with  Bayesian  networks  of  the  form  shown  in  Figure  4 
where  all  paths  between  a  priori  factors  r^’s  and  eviden¬ 
tial  clues  fi’s  pass  through  h. 


Figure  4:  A  Bayesian  network  where  all  paths  between  a 
priori  factors  r^’s  and  evidential  clues  fi’s  pass  through 
h. 

When  a  network  of  this  form  cannot  serve  as  a  jus¬ 
tifiable  model,  another  approach  can  be  used  instead; 
compose  a  Bayesian  multinet  ignoring  a  priori  fac¬ 
tors,  construct  a  Bayesian  network  from  the  local  net¬ 
works  by  taking  the  union  of  all  their  links  (e.g.,  the 
union  of  all  links  in  Figure  2  yields  the  Bayesian  net¬ 
work  of  Figure  1).  Finally,  add  a  priori  factors  to 
the  resulting  network.  This  approach  was  proposed  in 
(Heckerman,  1990). 

The  disadvantage  of  this  method  is  that  in  the  pro¬ 
cess  of  generating  a  Bayesian  network  from  a  multinet, 
one  encodes  sisymmetric  independence  in  the  parame¬ 
ters  rather  than  in  the  topology  of  the  Bayesian  network. 
Consequently,  these  asymmetric  assertions  are  not  avail¬ 
able  to  standard  inference  algorithm  to  speed  up  their 
computations. 

Nevertheless,  this  approach  is  still  the  best  alterna¬ 
tive  for  decomposing  the  construction  of  large  Bayesian 
networks  having  topologies  more  complex  than  that  of 
Figure  4.  Such  decomposition  techniques  are  crucially 
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needed  due  to  the  overwhelming  details  of  real-life  prob¬ 
lems.  Additional  issues  of  knowledge  acquisition  are  dis¬ 
cussed  below. 

Knowledge  Acquisition/  Representation 
Similarity  Networks 

Recall  the  guard  that  must  distinguish  between  workers, 
executives,  visitors  and  spies.  In  this  story,  some  vari¬ 
ables  do  not  help  distinguish  between  certain  hypotheses. 
For  example,  gender  and  badges  do  not  help  distinguish 
between  workers  and  executives,  and  limousines  do  not 
help  distinguish  between  spies  and  visitors.  In  richer  do¬ 
mains,  large  numbers  of  variables  are  often  not  relevant 
for  distinguishing  between  certain  hypotheses. 

Unfortunately,  the  Bayesian  multinet  approach  re¬ 
quires  full  specification  of  all  variables  in  each  local  net¬ 
work  even  when  they  are  not  relevant  to  distinguish  be¬ 
tween  the  hypotheses  associated  with  that  local  network. 
For  example  the  relationship  between  6  and  g  is  encoded 
in  the  local  network  for  spies  vs.  visitors  although  these 
variables  do  not  help  distinguish  between  this  pair  of  hy¬ 
potheses  (Figure  3).  Assessing  such  relationships,  in  con¬ 
texts  where  they  are  not  relevant,  imposes  insurmount¬ 
able  burden  on  the  expert  consulted  as  is  demonstrated 
by  the  following  quote  [Heckerman,  1990]; 

“When  the  expert  pathologist  was  asked  questions 

of  the  form 

Given  any  disease,  does  observing  feature  x 
change  your  belief  that  you  will  observe  feature 

y? 

the  expert  sometimes  would  reply 

I’ve  never  thought  about  these  two  features  at 
the  same  time  before.  Feature  x  is  relevant  to 
only  one  set  of  diseases,  while  feature  y  is  only 
relevant  to  another  set  of  diseases.  These  sets 
of  diseases  do  not  overlap,  and  I  never  confuse 
the  first  set  of  diseases  with  the  second.” 

The  solution  is  to  simply  include  in  each  local  network 
only  those  variables  that  are  relevant  for  distinguishing 
between  the  hypothesis  covered  by  that  local  network. 

However,  by  doing  so,  valuable  information  for  cor¬ 
rect  identification  might  be  lost.  For  example,  the  rela¬ 
tionships  between  badge-wearing  and  gender  in  Figure  3 
would  be  lost.  To  compensate  for  such  losses  of  informa¬ 
tion,  additional  local  networks  must  be  constructed. 

For  example,  the  secured-building  can  be  represented 
with  three  local  networks  shown  in  Figure  5  rather  than 
two  as  in  Figure  3.  One  network  is  used  to  distinguish 
between  spies  and  visitors,  another  between  visitors  and 
workers,  and  a  third  between  workers  and  executives.  In 
each  local  network  we  include  only  those  variables  rel¬ 
evant  to  distinguishing  the  hypotheses  covered  by  that 
local  network.  In  particular,  the  relationship  between 
badge-wearing  and  gender  is  not  included  in  the  local 
network  for  workers  vs.  executives  as  in  Figure  3.  This 


relationship,  however,  is  included  in  the  local  networks 
for  visitors  vs.  workers  because  it  helps  distinguish  be¬ 
tween  these  two  hypotheses.  The  reason  for  not  loosing 
needed  information  is  that  the  three  local  networks  are 
based  on  a  connected  cover  of  hypotheses  (rather  than  a 
partition). 


Figure  5:  A  similarity  network  representation  of  the 
secured-building  story. 

Definition  A  cover  oi  a  set  A  is  a  collection  {  Aj . A*,} 

of  non-empty  subsets  of  A  whose  union  is  A.  Bach  cover 
is  a  hypergraph,  called  the  similarity  hypergraph,  where 
the  Ai’s  are  edges  and  elements  of  A  are  nodes.  A  cover 
is  connected  if  the  similarity  hypergraph  is  connected. 

In  Figure  5,  {spy,  visitor},  {visitor,  worker},  {worker, 
executive}  is  a  cover  of  the  hypotheses  set.  This  cover  is 
connected  because  it  is  simply  a  four-nodes  chain  spy — 
visitor — worker — executive  which,  by  definition,  is  a  con¬ 
nected  hypergraph.  The  set  {{spy,  visitor},  {worker, 
executive}}  is  also  a  cover  but  it  is  not  connected.  The 
set  {{worker,  executive,  visitor},  {visitor,  spy}}  is  an  ex¬ 
ample  of  a  connected  cover  that  is  a  hypergraph  which 
is  not  a  graph. 

Definition  Let  U  =  {ui . .  .u„}  be  a  finite  set  of  vari¬ 
ables  each  having  a  finite  set  of  values,  P  be  a  probability 
distribution  having  the  cross  product  of  these  sets  of  val¬ 
ues  as  its  sample  space,  and  h  be  a  distinguished  variable 
among  the  Uj’s  that  represents  a  mutually-exclusive  and 
exhaustive  set  of  hypotheses.  Let  Ai,...,  Afc  be  a  con¬ 
nected  cover  of  the  values  of  h.  A  directed  acyclic  graph 
Di  is  called  a  comprehensive  local  network  of  P  (associ¬ 
ated  with  Aj)  if  it  is  a  Bayesian  network  of  P  assuming 
one  of  the  hypotheses  in  A,  holds,  i.e.,  D,-  is  a  Bayesian 
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network  of  P(«i . . .  The  network  obtained  from 

Di  by  removing  nodes  that  are  not  relevant  to  distin¬ 
guishing  between  hypotheses  in  Ai  is  called  an  ordinary 
local  network.  The  set  of  k  ordinary  local  networks  is 
called  an  (ordinary)  similarity  network  of  P. 

For  example,  the  local  networks  of  Figure  5  are  ordi¬ 
nary,  and  together  form  an  ordinary  similarity  network. 
Notably,  hypotheses  covered  by  each  local  network  are 
often  similcir  (e.g.,  spies  and  visitors),  ■*  a  choice  that 
maximizes  the  number  of  asymmetric  independence  re¬ 
lationships  encoded. 

Heckerman  (1990)  shows  that  under  several  assumpt- 
tions,  if  a  cover  is  connected,  one  can  always  remove  from 
each  local  network  variables  that  do  not  help  distinguish 
between  hypotheses  covered  by  that  local  network  and 
yet  not  loose  the  information  necessary  for  representing 
the  full  joint  distribution.  These  assumptions  consist  of 
1)  the  hypothesis  variable  is  a  root  node,  2)  the  cover 
is  a  graph  and  not  a  hypergraph,  3)  the  local  networks 
are  constrained  by  the  same  partial  order,  and  4)  the 
distribution  is  strictly  positive.  Theses  assumptions  are 
relaxed  below. 

Theorem  2  Let  {ui...u„}  be  a  finite  set  of  variables 
each  having  a  finite  set  of  values,  P  be  a  probability  dis¬ 
tribution  having  the  Cartesian  product  of  these  sets  of 
values  as  its  sample  space,  h  be  a  distinguished  vari¬ 
able  among  the  UiS,  and  S  be  a  similarity  network  of  P. 
Then,  the  posterior  probability  of  every  hypothesis  given 
any  value  combination  for  the  variables  in 
can  be  computed  from  the  parameters  encoded  in  S  pro¬ 
vided  p{hi)  /  0  for  every  value  hi  of  h. 

To  prove  the  above  theorem,  it  suffices  to  consider 
the  case  where  h  is  a  root  node  in  all  the  local  net¬ 
works  of  S  because,  otherwise,  arc-reversal  transforma¬ 
tions  [Shachter  1986]  can  be  applied  until  h  becomes  one. 

Also  note  that  since  the  similarity  hypergraph  is  con¬ 
nected,  it  imposes  n  —  1  independent  equations  among 
the  following  n:  p(ki)  ^  p(h,  M,)  ■  pM’  *  = 

1 . .  .n.  In  addition,  p(hi)  =  1.  The  values  for  p(hi) 
are  the  unique  solution  of  these  linear  equations  provided 
p(hi)  ^  0  for  t  =  1 . .  .n. 

Aside  of  the  priors,  the  only  remaining  parameters 
needed  for  computing  the  posterior  probability  of  each 
hypothesis  hi,  are  p(v2  ■  ■  .Unlbi)  where  V2  ■  ■  -  Vn  are  ar¬ 
bitrary  values  of  U2  •  •  ■  Un  (assuming  without  loss  of 
generality  that  h  =  ui).  Due  to  the  chaining  rule, 
p(t;2  •  •  Unlbi)  can  be  factored  as  follows: 

p(v2...v„lhi)  =  P(v2lhi)  ■  P(v3lv2  hi) . . . 

p(n„|t)i . .  .n„-i  hi). 

Thus,  it  suffices  to  show  that  for  each  variable  uy, 
p(vj  Iv2  ...  Vj^i  hi)  can  be  computed  from  the  parame¬ 
ters  encoded  in  S. 


Let  Di  denote  a  local  network  in  5,  i4y  be  the  hy¬ 
potheses  associated  with  Di,  and  hi  be  an  hypothesis  in 
Ai.  There  are  two  cases;  either  uy  is  depicted  in  or 
it  is  not.  Let  Ai,  Ai+i . . .  Am  be  a  path  in  the  similarity 
hypergraph  where  Am  is  the  only  edge  on  this  path  asso¬ 
ciated  with  a  local  network  that  depicts  uy  as  a  node.  If 
Uy  is  depicted  in  then  the  path  consists  of  one  edge 
Ai  which  is  equal  to  Am-  If  uy  is  not  depicted  in  any 
local  network,  then  uy  does  not  alter  the  posterior  prob¬ 
ability  of  any  hypothesis  and  is  therefore  omitted  from 
the  computations. 

Let  Dk  be  the  local  netowrk  associated  with  Ak  for 
k  =  i  -t  1 ...  m  and  let  hj+i,  A,+2  •  •  ■  hm  be  a  sequence 
of  hypotheses  such  that  hk  6  Afc_i  PI  A*.  Due  to  the 
definition  of  similarity  networks,  since  uy  is  not  depicted 
in  Dk  where  k  <  m,  the  following  equality  must  hold; 

p(t;ylt;2  •  •  •  v>-i  hfc-i)  =  p(ny|«2  •  •  -ny-i  hfc). 

Since  this  equation  holds  for  every  k  between  t  +  1  and 
m,  we  obtain, 

P(VjIv2  .  .  .  Uy_i  hi)  =  p(vjlv2  . .  .  uy_i  Am). 

Moreover, 

p(vy|v2  .  .  .  Vy_i  Am)  =  P{v,  \v{  .  .  .  vj  Am) 

where  u\  . . .  u]  are  the  variables  depicted  in  Dm  (a  subset 
of  {u2...uy_i})  because,  due  to  the  definition  of  sim¬ 
ilarity  network,  the  variables  deleted  are  conditionaUy 
independent  of  uy,  given  the  other  variables;  they  are 
disconnected  from  all  the  other  variables  in  Dm-  ^ 

Finally, 

p{Vj\v\  .  ..v't  hm)  =  p(vj\v\  ...v'l  hm.Am), 

because  Am  logically  implies  the  disjunction  over  all  hy¬ 
potheses  in  Am- 

The  latter  probability  is  computable  from  the  lo¬ 
cal  network  Dm  by  any  standard  algorithm  (e.g., 
[Pearl,  1988]),  thus,  due  the  three  equalities  above, 
p(vy|u2  •  •  •  uy_i  hi)  is  also  computable  as  needed.  □ 

For  example,  to  compute  P{g,  b,  /[spy)  we  use  the  fol¬ 
lowing  two  equalities  implied  by  Figure  5:  PVom  the 
first  local  network,  P(g,b,l\spy)  =  P(y|spy)  •  P(b\spy) 
P(i|sp!/)  and  from  the  absence  of  f  in  the  first  and 
second  local  networks,  P(f|spy)  =  P(l\worker).  Thus, 
Pigif’Jl^Py)  =  .P(ff|sp!/)  ‘  P(f>\spy)  P(l\worker),  where 
all  the  needed  probabilities  are  encoded  in  the  similar¬ 
ity  network.  In  fact,  the  proof  of  Theorem  2  provides 
a  general  way  of  factoring  any  desired  probability,  thus, 
the  full  joint  distribution  P(g,  b,  I,  A)  is  encoded  in  the 
ordinary  similarity  network  of  Figure  5. 

Similarity  networks  have  another  important  advamtage 
not  mentioned  so  far:  protecting  the  model  builder  from 
omitting  relevant  clues.  For  example,  suppose  workers 

'^Geiger  and  Heckerman  (1990)  discuss  weaker  definitions 
of  being  irrelevant  other  than  being  disconnected. 


■‘Hence  the  name;  similarity  network. 
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and  executives  often  curive  with  a  smile  to  work  (because 
the  secured  building  is  such  a  great  place  to  be  in)  while 
spies  and  visitors  arrive  seriously.  Such  a  clue,  smile,  is 
likely  to  be  forgotten  when  constructing  the  local  net¬ 
works  for  spies  vs.  visitors  and  for  visitors  vs.  executives 
because  it  does  not  help  distinguish  between  these  pairs 
of  hypotheses.  However,  when  constructing  the  similar¬ 
ity  network  of  Figure  5,  which  includes  a  local  network 
for  distinguishing  visitors  from  workers,  smile  is  more 
likely  to  be  recalled  because  the  distinctions  between  vis¬ 
itors  and  workers  are  explicitly  in  focus. 

Redundancy 

Basing  the  construction  of  local  networks  on  covers  of 
hypotheses  raises  the  problem  of  redundancy,  namely, 
that  some  parameters  are  specified  in  more  than  one  lo¬ 
cal  network.  For  example,  in  Figure  5,  the  parameter 
P(g,\visitor)  should,  in  principle,  be  specified  both  in  the 
first  and  in  the  second  local  network.  This  problem  is 
particularly  crucial  because  local  networks  are  actually 
constructed  from  expert’s  judgments  rather  than  from  a 
coherent  probability  distribution  as  implied  by  the  defi¬ 
nition  of  similarity  networks. 

One  way  to  remove  redundancy  is  to  automatically- 
translate  a  similarity  network  as  it  is  being  constructed 
to  a  Bayesian  multinet  which  is  never  redundant.  For 
example,  instead  of  storing  Figure  5,  we  can  actually 
store  Figure  3  which  contains  no  redundant  information. 
The  translation  is  done  by  the  following  algorithm. 

Conversion  Algorithm 

Input:  A  similarity  network  5  of  a  probability  distribu¬ 
tion  P. 

Output:  A  Bayesian  multinet  of  P. 

1.  For  each  ordinary  local  network  L  in  S'. 

•  Add  a  node  for  each  variable  not  represented  in  L. 

•  For  each  added  node  x,  set  the  parents  of  x  in  £ 
to  be  the  union  of  all  parents  of  x  in  all  other  lo¬ 
cal  networks  where  x  originally  appeared,  excluding 
variables  that  were  originally  in  L. 

2.  Remove  enough  local  networks  from  S  and  enough 
hypotheses  from  the  remaining  local  networks  until  a 
Bayesian  multinet  is  obtained. 

(A  finer  version  of  this  algorithm  is  forthcoming). 
Notably,  the  user  of  a  similarity  network  need  not 
know  about  the  conversion  to  a  Bayesian  multinet  which 
can  be  thought  of  Jis  an  internal  representation.  The 
user  benefits  from  both  the  advantages  of  similarity  net¬ 
work  for  knowledge  acquisition,  and  from  an  inference 
algorithm  (Section  2)  that  uses  the  Bayesian  multinet 
produced  by  the  conversion  algorithm. 

Generalized  Similarity  Networks 

Previous  sections  assume  all  hypotheses  are  mutually  ex¬ 
clusive  and  are,  therefore,  represented  as  values  of  a  sin¬ 
gle  hypothesis  variable  denoted  h.  Here  this  assumption 


is  relaxed.  We  allow  several  variables  to  represent  hy¬ 
potheses,  as  needed  by  the  following  example; 

Consider  the  guard  of  Section  2  who  has  to  distin¬ 
guish  between  workers,  visitors,  and  spies.  A  pair  of 
people  approach  the  building  and  the  guard  tries  to 
classify  them  as  they  approach.  Assume  that  only 
workers  converse  (c)  and  that  workers  often  arrive 
with  other  workers  (because  they  must  car-pool  to 
conserve  energy). 

A  Bayesian  network  representing  this  situation  is 
shown  in  Figure  6  where  nodes  hi  and  /12  stand  for  the 
respective  identity  of  the  two  persons.  (The  direction  of 
the  link  between  hi  and  h2  is  arbitrary.) 


Figure  6:  A  Bayesian  network  with  two  hypothesis  nodes 
hi  and  /i2- 


Alternatively,  we  can  represent  this  example  using  a 
generalized  similarity  network,  or  a  generalized  Bayesian 
multinet. 

Definition  Let  {uj . .  .u„}  be  a  finite  set  of  variables 
each  having  a  finite  set  of  values,  P  be  a  probabil¬ 
ity  distribution  having  the  cross  product  of  these  sets 
of  values  as  its  sample  space,  and  H  be  a  subset  of 
distinguished  variables  among  the  u<’s  each  represent¬ 
ing  a  set  of  hypotheses.  Denote  the  Cartesian  prod¬ 
uct  of  the  sets  of  values  of  the  distinguished  variables 
by  domain(H ).  Let  Ai,...,i4*;  be  a  connected  cover  of 
domainfH ).  A  directed  acyclic  graph  Di  is  called  a  com¬ 
prehensive  local  network  of  P  if  it  is  a  Bayesian  network 
of  P(wi . . .  u„lAi).  The  network  obtained  from  Di  by 
removing  nodes  that  are  not  relevant  to  distinguishing 
between  hypotheses  in  Ai  is  called  an  ordinary  local  net¬ 
work.  The  set  of  k  local  networks  is  called  a  generalized 
similarity  network  of  P.  When  Ai, ...,  Ak  is  a  partition 
of  domain(H ),  then  the  set  of  fc  comprehensive  local  net¬ 
works  is  called  a  generalized  Bayesian  multinet. 

For  example,  the  secured-building  story  is  represented 
in  the  generalized  similarity  network  of  Figure  7.  Note, 
H  =  {hi,  ho}  iind  domain(H)  consists  of  nine  ele¬ 
ments  (x,  y)  where  both  x  and  y  are  drawn  from  the  set 
{w,  V,  s).  A  connected  cover  of  domain(H )  upon  which 
Figure  7  is  based  consists  of:  {(s,  s)  (u,  s)  (s,  v)  (u,  u)}, 
{(u,v)  (lu,  u)  (u,  u;)  (iu,u;)},  and  {(s,  s)  (s,  lu)  (tu,  s)}. 
This  cover  is  connected. 

Most  asymmetric  independence  assertions  encoded  in 
Figure  7  were  either  explained  in  previous  sections  or  are 
obvious  from  the  verbal  description  of  the  story. 
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Figure  7:  A  generalized  similarity  network  with  two  hy¬ 
pothesis  nodes. 


The  absence  of  a  link  between  hi  and  in  the  top 
network  encodes  the  fact  that  if  the  guard  knew  that  one 
person  is  a  spy,  this  knowledge  would  not  help  him/her 
decide  whether  the  other  person  is  a  spy  or  a  visitor. 
The  existence  of  a  link  between  hi  and  ^2  in  the  mid¬ 
dle  network  encodes  the  fact  that  workers  come  in  pairs 
more  often  than  do  visitors.  Hence  the  knowledge  that 
one  person  is  a  worker  is  a  clue  for  classifying  the  other 
person. 

The  vanishing  dependence  between  hypothesis  vari¬ 
ables  hi  and  /12  in  case  of  spies  vs.  visitors  is  an  exam¬ 
ple  of  inter-hypothesis  independence.  Such  asymmetric 
assertions  cannot  be  encoded  in  ordinary  similarity  net¬ 
works. 


Summary 

This  paper  proposes  an  efficient  format  for  encoding  and 
using  asymmetric  independence  assertions  for  inference. 
The  model  builder  is  asked  to  express  knowledge  about 
independence  by  constructing  multiple  local  networks 
using  informal  guidelines  of  causation  and  time  ordering. 


Like  any  Bayesian  network,  local  networks  possess  pre¬ 
cise  semantics  in  terms  of  independence  assertions  and 
these  can  be  used  to  verify  1)  whether  the  network  faith¬ 
fully  represents  the  domain  and  2)  whether  the  input  is 
consistent. 

Multiple  local  networks  have  several  advantages  com¬ 
pared  to  a  single  Bayesian  network.  The  elicitation  of 
several  small  networks  is  easier  than  eliciting  a  single 
full-scale  Bayesian  network  because  the  expert  can  focus 
his/her  attention  to  particular  subdomains,  and  hence, 
provide  more  reliable  judgments.  Multiple  networks  rep¬ 
resent  a  domain  better  because  more  knowledge  about 
independence  is  qualitatively  encoded.  Algorithms  for 
finding  the  most  likely  hypothesis  run  faster  when  using 
multiple  networks.  And  finally,  the  overall  storage  re¬ 
quirement  of  multiple  networks  is  often  smaller  than  that 
of  a  single  Bayesian  network  because  as  independence  as¬ 
sertions  become  more  detailed,  less  numeric  parameters 
are  needed  for  describing  a  domain. 

Notably,  when  independence  assertions  in  the  domain 
are  symmetric,  a  single  Bayesian  network  is  preferable. 

The  challenges  remain  to  1)  devise  additional  graphi¬ 
cal  representation  schemes  of  salient  patterns  of  indepen¬ 
dence  assertions,  (2)  provide  computer-aided  elicitation 
procedures  for  constructing  these  representations,  and 
(3)  devise  efficient  inference  procedures  that  make  use 
of  the  encoded  assertions. 
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1  Introduction 

In  a  multivariate  Gaussian  model,  the  presence  of  a  zero 
in  the  inverse  variance  matrix,  or  in  the  partial  corre¬ 
lation  matrix,  implies  that  the  two  variables  are  inde¬ 
pendent  given  the  rest.  Thus  the  dependence  between 
variables  can  be  fully  represented  by  a  graph,  in  which 
the  absence  of  an  edge  implies  conditional  independence. 
This  leads  to  the  term  graphical  Gaussian  model,  and  fur¬ 
ther  to  theorems  concerning  the  equivalence  of  the  local, 
global  and  pairwise  Markov  properties  of  the  graphical 
model.  For  discrete  distributions  (or  other  multivariate 
continuous  distributions),  this  graphical  representation  is 
ambiguous,  as  the  interactions  may  involve  more  than 
two  variables  at  a  time.  By  convention,  the  presence  of 
a  clique  of  k  variables  in  a  graph  representing  a  cross- 
classified  multinomial  distribution  implies  that  the  joint 
distribution  includes  a  term  in  all  k  variables..  The  dis¬ 
tribution  does  not  in  general  factorize  into  (*)  pairwise 
components.  However,  a  hypergraph  gives  a  natural,  un¬ 
ambiguous,  representation.  ^  *1: 

A  hypergraph  comprises  a  set  of  nodes  (or  variables) 
together  with  a  set  of  hyperedges.  Each  hyperedge  is 
a  subset  of  the  set  of  nodes,  with  the  constraints  that 
no  hyperedge  is  the  empty  set  (<^),  and  the  union  of  all 
hyperedges  is  the  set  of  nodes.  Thus,  the  presence  of  a 
given  hyperedge  implies  a  corresponding  factor,  involving 
one  or  more  variables. 

To  demonstrate  the  flexibility  and  utility  of  hyper¬ 
graphs,  we  consider  hypergraph  representations  of  graph¬ 
ical  association /conditional  Gaussian  (CG)  models  for 
both  discrete  and  continous  variables  (Lauritzen  and  Wer- 
muth,  1989),  and  their  generalization  to  hierarchical  in- 
teraction/CG  models  (Edwards  1990).  Edwards  (1990, 
p.5)  gives  the  example  of  two  discrete  and  two  Gaussian 
variables  and  draws  the  independence  graph  for  the  model 
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in  which  the  two  discrete  variables  are  conditionally  inde¬ 
pendent,  and  likewise  the  two  continuous  variables.  With 
other  models  for  these  four  variables  the  graphical  rep¬ 
resentation  breaks  down,  but  the  hypergraph  represen¬ 
tation  does  not.  CG  models  provide  some  of  the  most 
exciting  applications  of  graphical  modeling;  we  focus  on 
the  special  case  of  ANOVA,  allowing  heterogeneous  vari¬ 
ances. 

The  Gibbs-Markov  equivalence  says  that  if  a  strictly 
positive  distribution  satisfies  the  conditional  indepen¬ 
dences  induced  by  a  graph  (through  graph  separation), 
then  this  distribution  is  the  product  of  functions  carried 
by  the  cliques  of  the  graph,  and  vice-versa.  Later  in  the 
paper  we  will  reformulate  this  equivalence  in  terms  of  hy¬ 
pergraphs  as  follows.  We  will  replace  the  conditional  in¬ 
dependences  with  their  equivalent  factorizations  into  two 
factors  (2-factorizations),  and  we  will  introduce  the  meet 
operation  on  hypergraphs,  which  will  allow  us  to  com¬ 
bine  several  2-factorizations  into  one  factorization  with 
n  >  2  factors.  This  has  two  advantages:  First,  it  will  show 
that  only  certain  factorizations  can  be  described  through 
the  conditional  independences  they  induce.  Second,  us¬ 
ing  methods  from  the  theory  of  relational  databases,  we 
give  conditions  that  generalize  the  equivalence  in  a  weaker 
form  to  distributions  that  are  not  strictly  positive. 


2  Conditional  Gaussian  models 

The  conditional  Gaussian  model  (Edwards,  1990,  Whit¬ 
taker,  1990)  specifies  the  joint  distribution  of  a  set  I  com¬ 
prising  k  discrete  variables  and  a  set  Y  comprising  q  con¬ 
tinuous  variables  to  be 

//y(i,y)  =  //(>) /y'|/(y|i),  (1) 

the  product  of  a  cross-classified  multinomial  distributions 
//  and  a  multivariate  Gaussian  density  /y|/  separately  in 
each  cell  i.  The  moment  parametrization  of  (1)  is 

/;v(i.y)  =  /f(i)  |^.|  ^  (2) 
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exp  |-^(y  -  Hi  Ny  -  #*i)| 

and  the  canonical  parametrization  is 

//K(i,y)  =  exp  joi  +  Diyj  (3) 

where  each  scalar  parameter  included  in  (ai,j3i,Di)  is 
expanded  using  all  subsets  of  7, 


E  A?  /3i=  E  *??  ^i=  E  n-  (4) 

aCI  *  aCI  *  aCI 


These  Me  called  the  discrete,  linear  and  quadratic  parts 
respectively.  Models  are  specified  by  restricting  some  el¬ 
ements  of  the  Aj,  rj?,  and  to  zero. 

Lauritzen  and  Wermuth  (1989)  develop  CG  graphical 
association  models,  for  which  a  graphical  representation 
suffices  (with  the  clique  convention  for  discrete  variables). 
The  attractiveness  of  a  graphical  association  model  is 
that  the  complete  set  of  conditional  independence  state¬ 
ments  can  be  read  from  the  graph  (hence  the  term  inde¬ 
pendence  graph).  The  model  is  fully  specified  by  these 
ternary  statements  of  the  conditional  independence  of 
pairs  of  variables  given  the  rest.  However,  graphical  as¬ 
sociation  models  are  unnecessarily  restrictive.  Edwards 
(1990)  gives  the  theoretical  basis  for  the  analysis  of  con¬ 
ditional  independence  in  hierarchical  interaction  models. 
He  defines  the  hierarchical  interaction  models  to  be  the 
CG  models  that  satisfy  the  marginality  principle.  Briefly, 
if  A?  is  not  identically  zero,  then  neither  is  each  of  Aj  for 
b  C  a.  If  the  rth  element  77?  of  r/?  is  not  identically  zero 
than  neither  are  ni  and  A;  for  6  C  a.  If  the  rsth  element 
of  is  nonzero,  then  neither  are  V’*  i  V':  .  i/’:  1  *?;  i 

Irr  Irs  Jss  ’Ir 

rf-  ,  and  for  b  C  a. 

Is  1  ~ 


Hypergraph  representation  of  hierarchical  inter¬ 
action  models.  We  demonstrate  that  hierarchical  in¬ 
teraction  models  can  be  represented  using  hypergraphs 
(although,  as  will  be  seen,  the  quadratic  parts  cause  some 
difficulties).  In  Section  4  we  carry  across  some  of  the  ba¬ 
sic  properties  of  independence  graphs  to  hypergraphs.  In 
so  doing,  we  argue  that  it  is  better  to  emphasize  factor¬ 
izations,  read  directly  from  the  set  of  hyperedges  than 
conditional  independence  statements.  In  particular,  in 
modeling  data  by  ANOVA  it  is  natural  to  think  in  terms 
of  several  overlapping  subsets  of  mutually  dependent  vari¬ 
ables,  each  a  hyperedge. 

The  marginality  principle  allows  us  to  use  the  reduced 
hypergraph,  that  includes  a  hyperedge  corresponding  to 
each  maximal  subset  of  variables.  The  hierarchical  inter¬ 
action  model  is  especially  demanding,  and  requires  that 
there  are  two  types  of  hyperedges.  A  type  1  hyperedge 


(Figure  1)  corresponds  to  a  maximal  discrete  or  linear 
part  in  (4).  A  type  1  hyperedge  containing  only  discrete 
variables  corresponds  to  some  A?  with  a  maximal.  A 
type  1  hyperedge  containing  k'  discrete  and  q'  continuous 
variables  corresponds  to  a  q'  <  q  subvector  of  tj?  where 
|aj  =  k'.  By  convention,  the  presence  of  two  or  more 
continuous  variables  in  a  type  1  edge  does  not  imply  an 
association  between  them.  A  type  2  hyperedge  (Figure  2) 
also  includes  both  discrete  and  continuous  variables,  and 
corresponds  to  a  maximal  quadratic  part  ip?.  When  there 
is  more  than  one  continuous  variable  in  a  type  2  edge,  the 
pairwise  interactions  are  implied.  Type  2  edges  must  be 
nested  inside  type  1  edges,  and  where  type  1  and  type  2 
edges  coincide,  the  type  1  edge  may  be  omitted. 

If  the  marginality  principle  were  to  be  dropped,  two 
types  of  edges  would  suffice,  but  the  nesting  property 
would  fail,  and  a  reduced  hypergraph  could  not  be  used. 

3  Analysis  of  variance 

To  illustrate  the  hypergraph  representation,  and  to  mo¬ 
tivate  the  use  of  hierarchical  interaction  models,  we  turn 
to  the  CG  regression  setting,  that  is,  the  conditional  part 
/y|/(*|y)  keeping  fi{i)  fixed.  With  a  single  continuous 
variable  Y  emd  k  factors,  this  is  the  einalysis  of  variance 
model.  Our  principal  concerns  have  a  practical  flavor: 

1.  The  set  of  possible  models  includes  the  lattice  of  2* 
models  for  the  linear  part,  each  multiplied  by  some 
number  of  models  for  the  quadratic  part.  How  is 
backward  or  forward  model  selection  to  be  viewed  as 
“local  operations”  on  hyperedges? 

2.  Graphical  models,  fit  by  maximum-likelihood,  axe 
commonly  compared  using  the  analysis  of  deviance. 
How  adequate  are  the  approximations  when  exact 
F-ratios  axe  available? 

3.  HierarchicaJ  interaction  models  allow  unequaJ  vari¬ 
ances  to  be  modeled  readily.  How  important  is  this 
feature? 

4.  In  the  classical  approach  to  ANOVA,  the  experimen¬ 
tal  design  places  restrictions  on  the  models  to  be  se¬ 
lected.  What  is  the  analog  for  graphical  models? 

5.  What  useful  information  is  contained  in  the  indepen¬ 
dence  structure  of  //(i)? 

Example  1.  Pilot  plant  data.  Box,  Hunter  and 
Hunter  (1978)  give  pilot  plant  data,  of  chemical  yield  Y 
meatsured  at  two  replicates  of  a  2^  design,  with  factors 
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Figure  1;  ANOVA  of  pilot  plant  data  using  hypergraphs 


temperature  T,  concentration  C,  and  catalyst  K.  Con¬ 
ventional  ANOVA  (Table  1)  shows  that  T,  C,  and  TK 
are  significant  at  1%. 


Table  1, 

.  Pilot  plant  ANOVA 

DF 

SS 

F-ratio 

F-value 

T 

1 

2116 

265 

0% 

C 

1 

100 

13 

1% 

K 

1 

9 

1 

32% 

TC 

1 

9 

1 

32% 

TK 

1 

400 

50 

0.01% 

CK 

1 

0 

0 

100% 

TCK 

1 

1 

0.1 

73% 

Error 

8 

64 

We  model  these  data  using  hypergraphs,  with  constant 
variance  {rjjt  =  V’j  =  0  otherwise).  Each  hyperedge 
includes  Y  and  some  subset  of  T,  C,  and  K .  We  always 
include  the  complete  model  for  the  discrete  variables,  that 
is,  the  hyperedge  {T,  C,K},  so  that  the  deviance  in  com¬ 
paring  the  fivi^y)  is  a  comparison  of  the  /y)/(y|i). 

Backward  elimination  algorithm. 


1.  Start  with  the  hypergraph  containing  the  single  max¬ 
imal  edge  {!',  /}. 

2.  Replace  in  turn  each  maximal  hyperedge,  containing 
k'  discrete  variables  and  Y ,  with  k'  hyperedges  each 
containing  Jk'  —  1  discrete  variables  and  Y.  (This 
eliminates  the  l:'-factor  interaction.) 

3.  Choose  the  model  at  step  2  with  the  minimum  de¬ 
viance  difference.  If  that  difference  is  statistically 
significant,  stop.  Otherwise,  reduce  the  resulting  hy¬ 
pergraph  and  return  to  step  2. 

Modifications  are  available  when  9  >  1,  and  for  forward 
selection,  or  a  stepwise  procedure.  The  backward  elimina¬ 
tion  algorithm  ceui  be  implemented  using  a  graphical  user 
interface  (see  Figure  1),  and  then  the  use  of  hypergraphs 
would  eliminate  the  need  for  difficult  modeling  formulae 
(Edwards  (1990)  and  Whittaker  (1990)). 

Figure  1  shows  backward  elimination  for  the  pilot 
plant  data.  The  three  steps  are  (1)  UL  to  UR:  eliminate 
TCK  interaction  (with  V);  (2)  UR  to  LL:  eliminate  CK 
interaction  and  reduce;  (3)  LL  to  LR:  eliminate  TC  in¬ 
teraction  and  reduce.  Notice  that  the  hyperedge  TCK 
is  shown  at  UR,  but  is  implicitly  included  in  the  models 
at  LL  and  LR  also.  The  arrows  are  annotated  with  the 
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P-values  from  both  the  analysis  of  deviance  and  F-ratios 
(Table  1).  The  two  sets  of  P-values  are  reasonably  close. 
Each  model  is  also  referred  to  the  original  model  (x|  and 
xi  respectively).  As  the  design  is  balanced  and  complete, 
independent  tests  of  the  two-way  interactions  CK  and 
TC  are  available  from  conventional  ANOVA  (Table  1). 
Modeling  with  hypergraphs  leads  to  a  hierarchical  inter¬ 
action  model  that  includes  the  K  main  effect  (contrary 
to  Table  1).  Notice  that  because  the  hyperedge  TCK  is 
implicitly  present,  we  may  not  conclude  from  Figure  1, 
LR  that  C  LK\Y. 

Example  2.  Dental  golds  data.  Hoaglin,  Mosteller 
and  Tukey  (1991)  and  Goodall  and  De  Veaux  (1990)  in¬ 
clude  extensive  analyses  of  data  on  the  hardness  Y  of 
dental  gold,  produced  using  three  methods  M  at  three 
temperatures  T  from  two  alloys  A  by  five  dentists  D.  A 
model  for  the  linear  part  is  {YMTD,  YMAD}.  Figure  2 
shows  hypergraphs  with  four  choices  of  quadratice  part. 
The  analysis  of  deviance  is  given  in  Table  2.  The  linear 
part  may  be  refined  further.  Each  choice  in  Table  2  is 
permissible  with  linear  part  {YMTD,  YMA}  but  the 
fourth  is  not  with  {YTD,YMA}. 


Figure  2:  Models  for  heterogeneous  variances  in  dental 
golds  data 


Table  2.  Analysis  of  deviance  for  dental  golds 
(The  P-valuf  compares  each  deviance 
to  the  first,  homogeneous  model) 


Quadratic 

Deviance 

Deg.  Free 

P-value 

y 

674 

119 

- 

YM 

671 

117 

27% 

YD 

666 

115 

9% 

YMD 

651 

105 

6% 

Role  of  experimental  design.  Just  as  in  the  classical 
approach  to  ANOVA,  the  experimental  design  limits  the 
models  that  can  be  considered.  A  F-factor  factorial  design 
without  replication  includes  no  (k  +  l)-vertex  hyperedge. 
Model  selection  by  backwards  elimination  in  the  dental 
golds  example  begins  with  four  three-factor  interactions. 

When  there  is  confounding  we  assume  that  some 
terms,  typically  the  high  order  interactions,  are  zero.  For 
example,  the  resolution  V  2®“^  fractional  factorial  with 
defining  relation  I  =  12345  confounds  1  and  2345,  12  and 
345,  etc.  The  maximal  model  includes  (j)  three-vertex 
hyperedges  (a  factorization).  The  resolution  IV  de¬ 
sign  with  I  =  1234  confounds  1  and  234,  12  and  34,  etc. 
Setting  the  three-way  interactions  and  three  two-way  in¬ 
teractions  to  zero  leaves  four  maximal  models  each  with 
three  two-way  interactions. 

In  the  design  of  experiments,  a  preliminary  fac¬ 
torization  of  the  variables  niay  be  used  to  decide  on  an 
appropriate  design.  For  example,  if  it  is  believed  that 
the  two-way  interaction  12  is  zero,  but  the  three-way  in¬ 
teraction  345  is  non-zero,  the  resolution  V  design  above 
may  be  used  with  a  different  initial  maximal  model  in 
the  backwards  elimination  algorithm.  In  a  future  paper 
we  will  discuss  the  relationship  between  factorization  and 
experimental  design  in  greater  detail. 

Factorizations  of  the  discrete  part.  Given  two  dis¬ 
crete  variables  A  and  B  each  at  two  level,  suppose  pro¬ 
portional  allocation,  that  is,  //  factorizes.  Then  it  is  easy 
to  show  that  the  estimates  of  A  and  B  m^n  effects  are 
independent  (in  a  main  effects  only  model).  More  general 
statements  are  true:  These  relate  the  factorization  of  // 
to  independence  statements  about  /3,  the  regression  coef¬ 
ficients,  since  var  =  (X"^X)~\  where  X  is  the  matrix 
of  dummy  variables. 

4  Hypergraph  Factorizations 

Factorizations  in  graphical  models.  Graphical 
models  are  usually  defined  in  terms  of  conditional  in¬ 
dependence,  and  are  represented  using  either  directed 
or  undirected  graphs  (see  for  example  Whittaker,  1990, 
or  Pearl,  1988).  However,  any  conditional  independence 
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statement  is  equivalent  to  a  factorization  of  the  overall 
distribution  (or  one  of  its  margins)  into  two  factors.  This 
equivalence  can  be  exploited  to  give  a  new  graphical  rep¬ 
resentation  of  conditional  independence  statements  and 
the  rules  that  govern  them.  Such  representations  are 
based  on  hypergraphs  instead  of  graphs,  which  gives  them 
several  advantages:  (1)  Hypergraphs  are  mathematically 
simpler  than  the  ternary  conditional  independence  rela¬ 
tion.  (2)  It  is  natural  to  consider  factorizations  involving 
more  than  two  factors,  but  conditional  independence  does 
not  allow  for  such  a  generalization.  (3)  Factors  can  be 
identified  with  independent  but  overlapping  subsystems 
where  the  variables  outside  the  overlap  are  independent, 
offering  a  convenient  modeling  paradigm. 

We  argue  that  the  concept  of  factorization  forms 
a  more  general  and  more  convenient  mathematical  foun¬ 
dation  for  the  theory  of  graphical  models  than  does  con¬ 
ditional  independence.  This  point  of  view  has  been  pur¬ 
sued  in  Thoma  (1989)  in  a  different  context  and  will  be 
the  subject  of  a  forthcoming  paper  by  the  authors. 

Below  we  will  focus  on  two  important  aspects  of 
this  idea.  First  we  will  show  how  conditional  indepen¬ 
dence  relations,  respectively  their  equivalent  factoriza¬ 
tions,  can  be  represented  graphically.  Secondly  we  will  fo¬ 
cus  on  the  description  of  arbitrary  factorizations  througli 
sets  of  conditional  independence  statements.  This  is  the 
content  of  the  Gibbs-Markov  Equivalence,  a  fundamental 
result  in  the  theory  of  graphical  models.  The  equivalence 
holds  only  for  strictly  positive  distributions.  Using  ideas 
from  the  theory  of  relational  databases  it  is  possible  to 
extend  the  equivalence,  in  a  weaker  form,  to  arbitrary 
distributions. 

Conditional  Independence.  Consider  the  set  U  = 
{Vi,...,Vn}  of  random  variables.  To  avoid  difficulties 
with  regularity  of  the  underlying  measure,  and  thus  to  fo¬ 
cus  on  the  hypergraph  representation,  we  assume  that  all 
variables  have  finite  outcome  spaces.  However,  all  prop¬ 
erties  discussed  below  can  be  extended  to  very  general 
distributions,  including  hierarchical  interaction  models. 
Let  X,  Y,  and  Z  be  three  disjoint  subsets  of  variables 
in  U.  Set  X  is  independent  ofY  given  set  Z,  written  as 
X  JL  Y  \  Z,  if  fxY\z  =  Sx\z  •  Jy\z-  If  X  and  Y  are 
conditionally  independent  given  Z,  then  there  exist  two 
functions  gxz  and  hYZt  such  that  fxYz  =  9xz  ■  ^Yz- 
Here  gxz  and  Hyz  are  functions  that  depend  only  on 
some  of  the  variables,  those  in  AT  in  the  case  of  gxzy 
and  those  in  TUZ  for  Ayz-  We  will  say  that  these  func¬ 
tions  are  carried  by  their  respective  sets  of  variables.  Note 
that  gxz  and  hyz  are  usually  not  margins  of  fxYZ- 

There  are  a  number  of  well  known  rules  that  gov¬ 
ern  conditional  independence.  See  for  example  Whittaker 


(1990)  or  Pearl  (1988).  We  give  four  axioms,  which  we 
call  the  coarsening,  projection,  substitution,  and  intersec¬ 
tion  axioms  respectively.  Let  X,  Y,  Z,  W  be  four  disjoint 
subsets  of  variables  in  U .  Writing  XY  for  X  UY,  the 
axioms  are 

1.  X  JL  YW  \  Z  =>  X  JLY  \  WZ 

2.  X  JL  YW  I  X  =>  X  JL  W  I  X 

3.  X  JL  y  I  WX  and  X  JL  W  I  X  =>  X  JL  y  W  I  X 

4.  X  JL  y  I  WX  and  X  JL  W  1  yX  =>  X  JL  y  W  I  X 

The  last  axiom  holds  only  if  the  joint  distribution  fxYWZ 
is  strictly  positive.  For  completeness,  two  additional  ax¬ 
ioms  must  be  added  to  the  set  of  four  (Pearl  1988).  These 
are  the  synunetry  eixiom,  XJLy|X  =>  yjLX|X,  and  the 
trivial  independence  axiom,  XA.<t>\Z.  Notice  also  that 
Axioms  1  and  2  provide  the  converse  to  Axiom  3,  and 
Axiom  1  the  converse  to  Axiom  4. 

Graphical  Representation.  The  conditional  inde¬ 
pendence  statement  X  JL  y  |  X  can  be  represented 
graphically  via  its  equivalent  factorization  as  in  Figure  3. 
If  the  two  factors  together  cover  all  the  variables  under 


X  Z  Y 


Figure  3;  X  JL  y  ]  X 

consideration  (the  set  U ,  left  side  of  Figure  3)  the  factor¬ 
ization  is  fu/J.  If  they  cover  only  a  subset  (right  side  of 
Figure  3),  the  factorization  is  emhedtfec/ since  this  implies 
that  only  a  margin  of  the  overall  distribution  factors. 

It  is  possible  that  /y  =  /a  •  /a,  where  A  and 
B  are  two  subsets  of  U ,  but  A  and  B  do  not  cover  U. 
In  this  case  the  variables  not  in  A  U  H  have  no  influence 
on  /(/.  This  leads  to  a  small  problem  with  our  graphi¬ 
cal  representation,  since  we  can  no  longer  tell  whether  a 
factorization  is  full  or  embedded  by  looking  at  the  set  of 
variables  covered.  Thus,  we  distinguish  the  two  cases  by 
using  a  different  color  or  line  style  to  represent  embedded 
factorizations,  as  shown  in  Figure  3. 

The  conditional  independence  X  JL  y  |  X  is 
equivalently  described  through  the  two  sets  XX  and  yX, 
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which  indicate  the  factors  of  the  corresponding  factoriza¬ 
tion.  The  set  {XZ,YZ]  is  called  a  scheme.  The  term 
is  borrowed  from  the  theory  of  relational  databases.  A 
scheme  is  equivalent  to  a  reduced  hypergraph  with  two 
(or  more)  hyperedges.  We  will  use  bold  capitals,  A,  B, 
. . . ,  to  designate  schemes. 

No  conditional  independence  relation  will  result 
in  a  scheme  where  one  component  is  a  subset  of  the  other. 
The  two  components  of  the  scheme  are  always  incompa¬ 
rable.  However,  we  can  consider  factorizations  where  one 
factor  is  carried  by  a  subset  of  the  other  factor.  For  ex¬ 
ample  fxYZ  =  fz  ■  fxY\z-  Since  it  is  always  possible 
to  factor  in  this  fashion,  from  the  factorization  point-of- 
view  we  need  only  consider  maximal  factors,  and  therefore 
the  reduced  hypergraph.  However,  additional  factors  may 
aid  in  interpretation,  for  example  main  effects  in  ANOVA 
with  interactions  present. 

GraphicaJ  Representation  of  Axioms.  Using  fac¬ 
torizations  and  their  schemes  we  can  represent  the  axioms 
given  above  in  graphical  form,  in  the  following  four  fig¬ 
ures.  The  following  terminology  is  convenient;  The  Icey  of 
scheme  A  =  {^1,^2}  is  the  set  A\  0^2,  and  sets  A\  \  A2 
and  A2  \  A\  are  the  wings  of  the  scheme. 

Axiom  1  says  that  from  a  given  factorization  we 
can  derive  a  new  one  by  moving  wing  elements  to  the 
key.  This  simply  adds  variables  to  the  factors  that  do  not 
influence  the  distribution. 


Axiom  3  shows  that  we  can  replace  one  factor 
with  a  factorization  that  covers  the  same  variables.  One 
of  the  resulting  three  factors  can  then  be  absorbed  and 
we  end  up  with  a  two-component  scheme  again. 


Figure  6:  Axiom  3,  Substitution 

To  formulate  Axiom  4  we  introduce  the  following 
definitions:  If  R  is  cin  arbitrary  set  of  subsets  of  U  then 
R®  is  its  reduction,  i.e.  the  set  of  maximal  elements  of  R 
(a  component  is  maximal  if  it  is  not  strictly  contained  in 
another  component).  The  meet  of  two  schemes  A  and  B 
is  the  set  A  A  B  :=  {i4  n  B  I  A  €  A,  B  €  B}®.  i.e.  the 
reduction  of  all  intersections  of  components  of  A  and  B. 
Axiom  4  says  that  if  the  distribution  is  strictly  positive, 
we  can  infer  from  two  given  factorizations  a  new  one,  the 
meet  of  the  given  schemes.  The  two  schemes  must  share 
a  component  to  ensure  that  the  meet  comprises  only  two 
components. 


Axiom  2  says  that  we  can  derive  a  new  factor¬ 
ization  by  clipping  elements  from  the  wings  of  a  given  Figure  7;  Axiom  4,  Intersection 

one.  However,  the  new  scheme  will  cover  fewer  variables. 

There  are  simple  example  showing  that  we  do  not  derive 
vadid  new  schemes  if  we  clip  elements  of  the  key. 

5  General  Factorizations 


Figure  5:  Axiom  2,  Projection 


Gibbs-Markov  Equivalence.  We  now  consider  fac¬ 
torizations  that  involve  more  than  2  factors,  and,  corre¬ 
spondingly,  schemes  with  more  than  two  components.  To 
distinguish  the  general  factorization  and  schemes  from 
those  involving  two  factors,  we  will  use  the  terms  2- 
factorization  and  2-scheme  for  the  latter.  Our  overall 
strategy  is  described  in  the  Introduction. 
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The  Gibbs-Markov  Equivalence,  one  of  the  cen¬ 
tral  results  for  graphical  models,  says  that  for  strictly 
positive  distributions  a  set  of  conditional  independence 
statements  (2-factorizations)  is  equivalent  to  a  factoriza¬ 
tion  involving  more  than  two  factors. 

Consider  the  following  example.  Let  f  =  g  ■  h  ■  k 
be  defined  over  the  set  of  variables  U  =  {A,B,C,D,E). 
Let  g  be  carried  by  margin  {A,B},  h  by  {B,C,D},  and 
k  by  {D,E}.  The  distribution  /  factors  into  three  com¬ 
ponents,  but  it  is  easy  to  derive  th**  following  three  2- 
factorizations  simply  by  multiplying  two  of  the  factors: 

/  =  igh)k 
f  =  g{fik) 

f  =  {gk)h 

Figure  8  shows  the  corresp<  jing  2-schemes. 


Figure  8:  Derived  2-Schemes 

In  this  particular  situation  we  can  reconstruct 
the  original  factorization  from  the  three  2-factorizations 
as  follows:  First  clip  the  element  E  from  the  wing  of  the 
second  2-scheme,  then  use  the  resulting  scheme  to  replace 
the  larger  component  of  the  first  2-scheme.  The  result  is 
the  original  factorization.  In  fact,  the  third  2-scheme  is 
superfluous.  Note  that  this  reconstruction  is  possible  even 
if  the  distribution  is  not  strictly  positive. 

It  is  not  always  possible  to  proceed  as  in  the  ex¬ 
ample.  Figure  9  shows  an  example  where  it  is  not  possible 
to  derive  any  2-factorizations. 


Figure  9:  Simple  Cyclic  Scheme 

We  are  therefore  faced  with  the  following  ques¬ 
tions:  (1)  Which  factorization  can  be  replaced  by  a  set 


of  2-factorizations?  (2)  Which  sets  of  2-factorizations  de¬ 
fine  a  factorization?  In  addition,  we  need  to  know  how 
to  determine  the  set  of  2-schemes  that  is  equivalent  to  a 
given  factorization,  and  how  to  determine  the  scheme  of 
a  factorization  from  a  given  set  of  2-schemes. 

Results.  The  results  differ  depending  on  whether  the 
overall  distribution  is  strictly  positive  or  not. 

If  the  distribution  is  strictly  positive,  i.e.  fir  >  0, 
then  any  set  of  2-factorizations  (all  involving  the  same  set 
of  variables)  combine  to  give  a  factorization  with  at  least 
two  factors.  Its  scheme  is  the  meet  of  all  2-schemes  of 
the  given  2-factorizations.  Furthermore,  for  any  factor¬ 
ization  with  a  conforma.1  scheme  ^  there  is  an  equivalent 
set  of  2-factorizations.  The  corresponding  2-schemes  can 
be  determined  as  follows:  Divide  the  components  of  the 
n-scheme  into  two  groups,  and  determine  for  each  group 
the  union  of  its  members.  The  two  resulting  sets  form  a 
2-scheme.  Each  possible  way  of  forming  two  groups  will 
determine  a  2-scheme.  Some  groupings  may  not  yield  a  vi¬ 
able  2-scheme,  and  some  groupings  may  yield  the  same  2- 
scheme,  but  overall  they  will  determine  a  set  of  2-schemes 
whose  meet  coincides  with  the  scheme  of  the  original  fac¬ 
torization. 

If  the  distribution  is  not  strictly  positive,  i.e. 
fv  >  0i  than  any  conflict-free  ^  set  of  2-factorizations 
(all  involving  the  same  set  of  variables)  can  be  combined 
into  one  overall  factorization.  Its  scheme  is  the  meet  of 
the  given  2-schemes,  and  it  is  acyclic  Furthermore,  for 
any  acyclic  factorization  there  is  an  equivalent  set  of  2- 
factorizations.  The  2-factorizations  can  be  found  using 
the  same  method  as  in  the  strictly  positive  case. 

Distributions  that  are  not  strictly  positive  have 
a  support  (the  set  of  arguments  for  which  the  distribu¬ 
tion  has  non-zero  probability)  that  does  not  cover  the 
entire  outcome  space.  Such  a  distribution  will  not  factor 
unless  its  support  factors  too.  It  is  therefore  not  sur¬ 
prising  that  the  factorization  properties  of  arbitrary  dis¬ 
tributions  are  closely  related  to  those  of  sets.  The  set 
case  has  been  studied  extensively  in  the  theory  of  rela¬ 
tional  databases,  and  both,  terminology  and  results,  can 
readily  be  extended  from  the  set  to  the  distribution  case. 
The  support  of  a  strictly  positive  distribution  is  the  entire 

^  A  scheme  is  conformnt  if  its  components  are  equal  to  the  cliques 
of  a  graph  over  th<‘  samo  set  of  variables,  or  equivalently,  if  the 
scheme  is  the  meet  of  a  set  of  2-schemes  (Thoma  1989). 

^For  a  definition  of  conflirf-free  sets  of  2-scheme  we  refer  the 
reader  to  the  influential  paper  by  Beeri  et  al.  (1983)  and  to  the 
forthcoming  paper  by  the  authors,  which  will  giv-e  a  more  detailed 
discussion  of  the  issues  involved. 

scheme  is  acvclic  if  there  is  a  triang\ilaled  graph  over  the 
same  set  of  variables,  such  that  the  cliqxies  of  the  graph  roinride 
with  the  sclieme  components. 


outcome  space,  and  these  distribution  are  therefore  not 
subject  to  the  restrictions  that  apply  to  sets. 
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1  Objectives 

TESTGRAF  is  a  program  designed  to  graph 
the  performance  of  examination  questions  in 
a  way  meaningful  to  statistically  naive  exam¬ 
iners.  It  was  developed  with  the  college  or 
university  instructor  in  mind  who  has  given  a 
multiple  choice  exam  to  a  class  of  a  hundred 
or  more  students,  and  who  wants  to  evaluate 
test  items  with  a  view  to 


more  efficient  than  the  traditional  percentage 
correct  estimates. 

TESTGRAF  also  hsis  a  module  aimed  at 
showing  examinees  how  much  information  is 
provided  by  the  exam  about  their  ability  or 
proficiency  in  the  subject  being  tested. 

2  Characteristic  Curves 


•  deciding  whether  or  use  or  reject  an  item 
in  determining  the  final  grade, 

•  getting  information  that  will  help  in  the 
rewriting  of  items  for  future  use, 

•  identifying  items  which  might  be  added  to 
a  pool  for  constructing  subsequent  exams, 

•  determining  aspects  of  student  perfor¬ 
mance  on  the  test  as  a  whole. 

The  program  also  generates  examinee  abil¬ 
ity  estimates  which  are  optimal  in  the  sense 
that  they  use  the  subtantial  information  pro¬ 
vided  by  which  wrong  options  were  chosen 
for  incorrectly  answered  questions.  The  abil¬ 
ity  estimates  are  also  optimal  in  a  statisti¬ 
cal  sense  (maximum  likelihood  conditional  on 
item  characteristics),  and  thus  automatically 
weight  test  items  by  their  efficiencies.  These 
ability  estimates  are  therefore  substantially 


The  central  concept  in  the  modern  statistical 
theory  of  tests  is  the  Hem  or  option  charac¬ 
teristic  function,  shown  in  Figure  1.  Ability 
is  viewed  as  a  latent  variable  which  indexes 
the  probability  that  a  specific  answer  or  op¬ 
tion  will  be  chosen  among  those  presented  for 
a  given  test  item.  The  function  Fim(^)  plot¬ 
ted  in  Figure  1  for  each  of  the  five  options  for 
Item  1  is  the  probability  that  option  m  will 
be  chosen  for  item  i  by  examinees  at  or  near 
ability  level  0 . 

In  Figure  1  the  solid  line  indicates  the  prob¬ 
ability  that  the  option  is  chosen  that  is  des¬ 
ignated  by  the  examiner  as  correct,  and  as 
one  might  hope,  it  shows  that  examinees  with 
low  ability  have  a  small  probability  of  getting 
the  item  correct,  but  that  this  probability  in¬ 
creases  rapidly  over  ability  values  55  to  70,  af¬ 
ter  which  the  probability  of  chosing  the  correct 
answer  is  very  high  The  dashed  curves  show 
the  corresponding  probabilities  that  the  vari- 


Analyzing  Examination  Data  39 


1  Item  1 


Figure  1:  Option  Characteristic  Curves 


ous  wrong  options  will  be  chosen,  and  we  ob¬ 
serve  that  option  2  is  especially  popular  with 
the  weakest  examinees,  while  option  3  tends 
to  attract  those  with  high  ability  and  hardly 
anyone  chooses  option  5.  The  5%,  25%,  50%, 
75%,  and  95%  quantiles  of  the  actual  distribu¬ 
tion  of  percentage  correct  (the  traditional  and 
usual  scoring  scheme)  are  indicated  by  the  ver¬ 
tical  dashed  lines.  Vertical  bars  on  the  correct 
answer  curve  show  95%  pointwise  confidence 
limits  for  this  function. 

3  Ability  0 

It  should  be  appreciated  at  the  outset  that 
the  latent  variable  0  designed  to  capture  uni¬ 
dimensional  variation  among  examinees  in 
knowledge,  proficiency,  or  ability  is  not  an  in¬ 
dependent  variable,  but  rather  an  index  for  a 
family  of  Bernoulli  probability  distributions. 
As  such  it  is  only  defined  to  within  an  arbi¬ 
trary  order-preserving  transformation  g,  since 


if  ^  then  defining  P*  —  P  o  g~^  im¬ 
plies  P*{i)  =  =  P{0)-  This  means 

that  the  essential  task  is  to  estimate  t  he  rank 
of  examinee  a, a  =  ,N ,  after  which  the 
ability  values  0a  can  be  assigned  by  any  con¬ 
venient  order-preserving  transformation  of  the 
N  ranks. 

Consequently,  ability  values  are  assigned  as 
follows: 

Step  1:  Use  some  statistic  To  to  order  ex¬ 
aminees.  By  default  TESTGRAF  uses 
the  conventional  proportion  correct  to  do 
this,  but  TESTGRAF  also  permits  the 
user  to  input  any  set  of  values,  includ¬ 
ing  the  result  of  some  other  type  of  scor¬ 
ing  of  the  exam,  results  from  other  ex¬ 
ams,  or  ability  estimates  from  a  previous 
TESTGRAF  analysis. 

Step  2:  Assign  the  quantiles  of  the  standard 
Gaussian  distribution  0a  =  Za  to  the  or¬ 
dered  examinees.  Since  most  examination 
administrations  tend  to  produce  approx¬ 
imately  Gaussian  exam  scores,  this  per¬ 
mits  the  ability  values  to  roughly  reflect 
the  statistical  properties  of  familiar  exam 
scores. 


4  Estimation  of  Pim{G) 


The  option  characteristic  function  is  estimated 
by  kernel  smoothing  of  the  bivariate  relation¬ 
ship  between  ability  0a  and  the  binary  variable 
y«mo  taking  the  value  1  if  examinee  a  of  abil¬ 
ity  0a  chose  option  m  for  option  i,  eind  zero 
otherwise.  Kernel  smoothing  with  Gaussian 
Nadaraya^Watson  weights  is  employed,  so  that 
Step  3: 


Pim(0)  = 


Eg  ^4^) 


where 


tVa(0)  =  exp -[(0a  -  0)/fi]^/2. 
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Since  the  number  of  examinees  N  may  num¬ 
ber  in  the  thousands,  the  Fast  Fourier  Trans¬ 
form  (Hardle,  1987)  is  used  to  keep  the  number 
of  calculations  to  0{N  +  MlogM),  where  M 
is  the  number  of  equEilly  spaced  values  of  6  at 
which  the  functions  are  to  be  evaluated. 

Extensive  experience  indicates  that  the 
bandwidth  parameter  A  may  be  set  to 
in  general,  although  the  user  can  override  this 
default.  However,  since  a  constant  bandwidth 
tends  to  be  somewhat  inefficient  when  the  in¬ 
dependent  variable  is  not  equally  spaced,  and 
since  Gaussian  quantiles  become  sparse  in  the 
tails,  this  first  smoothing  step  tends  to  pro¬ 
duce  rather  variable  curve  values  for  \6\  >  2. 
Consequently,  a  second  smoothing  step  is  then 
used: 

Step  4: 

The  estimated  function  values  Pim(^)  are 
now  smoothed  over  the  M  equally  spaced 
values  of  6  using  the  variable  bandwidth 

A*  =  Aexp(0V8)/2.O. 

Finally,  most  instructors  are  familiar  with 
percentage  correct  as  an  indicator  of  abil¬ 
ity  rather  than  the  admittedly  artificial  stan¬ 
dard  Gaussian  values.  Consequently,  indicat¬ 
ing  the  curve  for  the  correct  option  by  Pic, 
the  transformation  r]{9)  —  Pic{0),  which 

is  nearly  certain  to  be  strictly  monotonic,  in 
effect  transforms  Gaussian  abilities  into  the 
expected  number  of  correct  items,  and,  when 
reexpressed  as  a  percentage,  tends  to  be  more 
intuitive  for  most  instructors. 


Examinae  2 


Figure  2:  Relative  Credibility  Curves 


unity,  and  are  referred  to  as  relative  credibility 
curves.  Figure  2  shows  an  example.  For  exam¬ 
inee  2  taking  a  100-item  test,  we  see  that  the 
most  likely  ability  value  is  61%,  even  though 
the  observed  percent  correct  is  only  56%  (in¬ 
dicated  by  the  vertical  dashed  line).  The  dis¬ 
crepancy  is  due  to  the  fact  that  the  maxi¬ 
mum  credibility  curve  estimate  takes  account 
of  wrong  option  choices  euid  of  the  efficiency  of 
items  answered  correct,  and  hence  uses  more 
information  than  simply  counting  correct  an¬ 
swers.  The  curve  also  indicates,  by  the  two 
dashed  lines  under  the  curve,  that  about  95% 
of  the  posterior  probability  falls  between  56% 
and  68%. 


5  Credibility  Curves  for  0 

TESTGRAF  can  also  plot  the  posterior  den¬ 
sity  function  for  ability  9  for  selected  exami¬ 
nees,  conditional  on  the  estimated  option  char¬ 
acteristic  curves.  For  clarity  of  plotting,  these 
curves  are  normalized  to  have  a  maximum  of 


6  PCA  Display 

As  a  summary  display  TESTGRAF  shows 
each  correct  option  curve  Ac(^)  plotted  at  a 
position  defined  by  the  principal  components 
scores  for  ..  [irincipal  components  analysis  of 


Analyzing  Examination  Data  41 


Figure  3:  Principal  Components  of  Correct 
Option  Curves 


curve  values.  In  this  analysis,  the  M  values  of 
0  used  to  plot  the  curves  play  the  role  of  the 
variables  in  a  conventional  analysis,  while  the 
cases  or  replications  are  the  items.  Curve  val¬ 
ues  are  weighted  by  the  inverse  of  pointwise  er¬ 
ror  variances  in  computing  the  cross-products 
matrix  on  which  eigcnalysis  is  performed. 

Figure  3  shows  a  display  for  a  100-item  test. 
Here  we  see  that  the  very  difficult  items  an¬ 
swered  correctly  by  very  few  examinees  are 
clustered  at  the  lower  left,  while  the  extremely 
easy  items  are  found  at  the  lower  right.  Items 
with  flat  or  even  descending  correct  option 
curves  show  up  at  the  lower  edges  of  the  plot, 
while  steeply  increasing,  and  hence  highly  ef¬ 
ficient,  items  are  found  in  the  upper  regions. 


7  Other  Results 

TESTGRAF  also  can  plot  other  useful  func¬ 
tions.  One  of  these  is  the  test  information 


function  I{0),  defined  as  the  expected  Hessian 
with  respect  to  ability  6, 


I  m 


[dPim(e)/de]‘^ 

Pim{0) 


This  function  indicates  the  amount  of  informa¬ 
tion  about  9  provided  by  the  test  for  each  level 
of  ability,  and  can  be  used  to  show  the  ability 
ranges  to  which  the  test  tends  to  be  “tuned” . 

The  program  can  also  create  a  file  contain¬ 
ing  the  maximum  likelihood  estimates  of  abil¬ 
ity  for  each  examinee.  These  can  be  used 
to  score  the  exam,  and  can  also  be  input  to 
TESTGRAF  to  provide  a  more  efficient  baisis 
for  ranking  examinees. 

Finally,  TESTGRAF  can  create  a  file  of 
commands  which  are  subsequently  processed 
by  another  program,  TESTLASR,  to  produce 
Postscript  commands  for  laser  printer  hard 
copies. 

The  program  and  documentation  are  avail¬ 
able  from  the  author.  A  small  fee  is  requested 
to  cover  the  cost  of  reproduction  and  distribu¬ 
tion.  A  more  complete  discussion  of  technical 
aspects  of  TESTGRAF  can  be  found  in  Ram¬ 
say  (1991). 
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1  Abstract 

There  are  three  primary  reasons  to  transform  data:  lack 
of  symmetry,  nonconstant  variance,  and  interaction  be¬ 
tween  factors.  We  present  a  display  that  has  separate 
graphics  designed  to  diagnose  each  of  these  conditions. 
The  user  is  thus  free  to  weigh  the  importance  of  each  of 
these  three  criteria  for  the  problem  at  hand,  and  then  to 
choose  the  transformation  that  seems  most  suitable.  As 
is  the  practice  of  many  data  analysts,  this  system  uses 
only  a  few  select  transformations  rather  than  transfor¬ 
mations  to  arbitrary  powers. 

Although  this  system  is  demonstrated  with  data  from 
designed  experiments,  it  may  also  be  used  for  regression 
problems.  .. 

KEYWORDS:  robustness,  symmetry,  running  scale 
estimation. 

2  Introduction 

Transformation  can  often  achieve  the  assumptions  im¬ 
plicit  in  a  regression  or  other  estimation  problem.  Such 
assumptions  include:  the  distribution  of  the  errors  is 
symmetric  (or  Gaussian),  and  the  variance  is  constant. 
At  times  a  transformation  can  also  produce  a  more  par¬ 
simonious  model. 

In  the  present  paper  we  use  the  power  transformations 
of  Box  and  Cox  (1964).  This  family  of  transformations, 
which  includes  the  logarithm,  embraces  those  most  com¬ 
monly  used.  We  also  use  robust  estimation  to  ensure 
that  the  results  are  not  unduly  swayed  by  a  few  outliers. 

For  background  on  transformations,  see  chapters  4  and 
8  (written  by  Emerson  and  Stoto)  of  Hoaglin,  Mosteller 
and  Tukey  (1983).  Also,  the  Box  and  Cox  (1964)  pa¬ 
per  (and  its  discussion)  contains  many  interesting  com¬ 
ments.  Robustness  of  transformations  is  discussed  in 

‘Research  supported  by  NSF  grant  ISI  88-61156 


Carroll  ('98^.  1982),  and  nonparametric  transformations 
are  explained  n.  Hastie  and  Tibshirani  (1990). 

An  advantage  of  the  display  being  introduced  is  that  it 
shows  the  effect  of  transformation  on  each  of  the  criteria 
individually.  See  Sampson  and  Guttorp  (1991)  for  an 
example  in  which  it  is  desirable  to  attain  symmetry  and 
constant  variance  without  destroying  interaction. 

3  Symmetry 

Symmetry  is  diagnosed  graphically  by  producing  a  plot 
based  on  the  residuals  from  the  fit  for  a  particular  trans¬ 
formation.  As  is  done  in  the  other  plots,  the  residu¬ 
als  from  a  robust  fit  are  used  by  default,  but  the  least 
squares  residuals  may  optionally  be  used. 

Let  r(j)  be  the  zth  order  statistic  of  the  residuals  scaled 
by  the  (Gaussian-consistent)  median  absolute  deviation, 
let  n  be  the  number  of  residuals  and  let  M  be  the  median 
of  the  scaled  residuals.  For  each  i  between  n/2  and  1, 
the  quantity 

(»’(,•)  +  »'(„_j))/2  -  M 

is  plotted  versus  the  value  of  i.  If  the  distribution  is 
symmetric,  this  will  tend  to  be  a  flat  line  at  zero. 

Since  the  points  in  this  plot  are  dependent,  the  sym¬ 
metry  plots  typically  show  a  curve  even  when  samples 
come  from  a  symmetric  distribution.  It  thus  becomes  im¬ 
portant  to  have  a  minimum  range  that  the  y-axis  spans. 
A  glance  at  the  asymptotic  distribution  of  the  points  in 
the  plot  (Stuart  and  Ord,  1987,  p.452)  and  the  inspec¬ 
tion  of  plots  for  several  sample  sizes  and  distributions  led 
to  forcing  ±4/v^  to  appear  in  the  plot  (a  dashed  line 
is  drawn  at  these  two  values).  When  several  points  fall 
outside  the  dashed  lines  and  they  form  a  definite  curve, 
then  asymmetry  may  be  assumed. 

The  plot  described  above  is  similar  to  plots  proposed 
by  several  people;  these  are  reviewed  in  Fisher  (1983). 
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4  Homoscedasticity 

To  diagnose  heteroscedcisticity,  we  plot  running  scale  es¬ 
timates  of  the  residuals  versus  the  fit.  The  running  scale 
estimate  sorts  the  fitted  values  into  ascending  order.  A 
certain  fraction  of  the  data  enter  into  the  estimation  at 
each  step  (we  have  used  one-half  in  the  examples).  The 
location  for  a  step  is  considered  to  be  the  mean  of  the 
fitted  values  that  are  being  used.  Both  the  standard 
deviation  and  a  robust  scale  estimate  of  the  residuals 
(corresponding  to  the  fitted  values  used)  are  computed 
at  each  step.  The  robust  estimate  that  is  used  is  the 
A-estimate  of  scale  based  on  the  bisquare  that  has  an 
efficiency  of  80  percent  at  the  Gaussian  distribution,  see 
Burns  and  Martin  (1991). 

The  test  of  the  null  hypothesis  that  the  residuals  are 
homoscedastic  is  the  Spearman  rank  correlation  test  of 
the  fit  versus  the  absolute  value  of  the  residuals.  This 
test  was  proposed  by  Horn  (1981). 

5  Parsimony 

In  designed  experiments  it  is  possible  to  make  plots  of 
the  interaction  of  two  factors;  such  plots  were  not  chosen 
for  two  reasons;  simplicity  and  generality.  Since  there 
can  be  a  great  number  of  pairs  of  factors  (not  to  men¬ 
tion  triples  and  so  on),  the  display  of  interactions  is  a 
complicated  task  best  suited  to  a  specialized  procedure. 
Additionally,  the  general  regression  problem  is  not  of¬ 
ten  thought  of  in  terms  of  interactions.  By  producing  a 
different  plot,  both  designed  experiments  and  general  re¬ 
gression  problems  can  benefit  from  the  same  set  of  plots. 

We  selected  a  barchart  that  tells  how  well  a  simple 
(user-specified)  model  does.  For  both  the  least  squares 
and  the  robust  fit  there  is  both  a  standard  and  a  robust 
estimate  of  the  fraction  of  variability  explained.  The 
standard  method  is  the  fraction  of  the  sum  of  squares 
explained  by  the  model.  The  robust  method  uses  a  t- 
estimate  of  scale  based  on  a  Huber  function  with  tuning 
constant  1.7  (Burns  and  Martin,  1991).  Let  r  denote 
this  scale  estimate  with  the  median  used  as  the  location 
estimate,  and  let  y  and  r  denote  the  response  and  the 
residuals,  respectively.  Then  the  fraction  of  variability 
explained  is 


6  The  Display 

The  ingredients  of  the  display  are  the  four  types  of  plot 
—  “residual  versus  fit”,  hcterosccdasticity  plot,  symme¬ 


try  plot  and  parsimony  plot.  The  allowable  transforma¬ 
tions  are  square,  identity,  square  root,  cube  root,  loga¬ 
rithm,  inverse  square  root  and  inverse.  An  implementa¬ 
tion  of  the  display  was  made  in  S-PLUS. 

AH  four  plots  are  viewed  for  a  single  transformation,  or 
one  type  of  plot  is  viewed  for  up  to  four  transformations. 

In  preparation  for  the  display,  the  model  is  fit  for  each 
transformation  both  with  least  squares  and  with  a  robust 
technique.  For  the  examples,  the  L\  solution  was  used. 
This  has  a  high  breakdown  point  for  balanced  designs, 
but  moderately  low  efficiency  at  the  Gaussian  model.  A 
different  algorithm  should  be  used  for  the  general  regres¬ 
sion  problem  since  leverage  becomes  more  of  an  issue.  A 
high-breakdown,  high-efficiency  algorithm  is  preferred. 

The  user  may  also  choose  whether  to  use  the  robust 
residuals  (the  default)  or  the  least  squares  residuals. 


7  Example 

We  use  the  poison  data  discussed  in  Box  and  Cox  (1964), 
and  in  many  subsequent  papers  on  transformation.  This 
dataset  consists  of  4  observations  on  each  combination 
of  3  poisons  and  4  treatments.  The  parsimonious  model 
that  is  used  is  the  additive  one  —  the  response  is  modeled 
as  poisons  plus  treatments. 

Figure  1  shows  the  display  for  the  response  in  the  orig¬ 
inal  units.  There  is  clearly  non-constant  variance,  and 
unevenness  of  the  bars  in  the  parsimony  plot  indicates 
that  there  is  a  problem  with  non-Gaussian  errors.  Both 
the  plot  for  symmetry  and  the  “residual  versus  fit”  plot 
indicate  that  there  is  not  symmetry.  When  the  least 
squares  residual  are  used,  there  is  slightly  less  indica¬ 
tion  that  a  transformation  is  needed;  the  symmetry  plot 
is  especially  degraded. 

Figure  2,  using  the  inverse  of  the  response,  is  close 
to  the  ideal.  The  fraction  of  variability  explained  is 
much  higher  and  virtually  the  same  on  all  four  bars. 
The  symmetry  plot  is  bent  down  slightly,  indicating  that 
the  inverse  transform  could  be  too  strong.  The  running 
scale  still  hais  some  tendency  of  a  positive  slope,  which 
would  indicate  a  transformation  that  is  not  quite  strong 
enough. 

Symmetry  plots  for  four  transformations  are  shown  in 
figure  3.  Only  the  plot  for  the  identity  transform  shows 
a  definite  trend  —  the  other  three  plots  are  indicating  no 
or  very  slight  asymmetry.  The  inverse  square  root  seems 
to  be  close  to  the  optimal  transform  for  symmetry. 


j 
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Figure  1:  Poison  Data,  Original  Scale 


8  Discussion 

Transformation  is  a  common  data  analysis  task.  With 
the  graphical  display  introduced  in  this  paper  a  data 
analyst  can  quickly  decide  on  an  appropriate  transfor¬ 
mation  or  see  that  transformation  will  have  little  effect 
on  the  analysis. 

The  types  of  plots  presented  may  also  be  used  individ¬ 
ually  to  explore  data  even  when  transformation  is  not 
being  considered.  In  particular,  the  plot  for  symmetry 
presented  here  is  more  usable  than  those  previously  pro¬ 
posed  because  of  the  additional  lines  that  indicate  the 
significance  of  a  curve  in  the  plot. 
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Figure  2;  Poison  Data,  Inverse  Scale 
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Figure  3:  Symmetry  Plots  for  the  Poison  Data 
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Graphical  methods  are  developed  for  frequency 
distributions  of  fully  ranked  data  with  pseudoranks. 
The  proposed  graphical  techniques  use  permutation 
polytopes,  and  are  compatible  with  both  Spearman’s 
p  and  Kenmdall’s  r.  The  problem  of  visualization  in 
higher  dimensions  is  also  addressed. 

1.  INTRODUCTION 

-Graphical  methods  are  critically  needed  to 
display  frequency  distributions  for  fully  ranked  data. 
Fully  ranked  data  occur,  for  example,  when  judges  are 
asked  to  rank  n  items,  possibly  with  pseudoranks,  in 
order  of  preference.  Each  observation  is  a 
permutation  of  the  n  distinct  pseudoranks,  and  the 
resulting  set  of  frequencies  is  a  function  on  S^,  the 
symmetric  group  of  n  elements.  Because  Sj,  does  not 
have  a  natural  linear  ordering,  graphical  methods  such 
as  histograms  and  bar  graphs  cannot  be  used  to 
display  frequency  distributions  for  ranked  data. 
Other  existing  graphical  methods  for  rankings  include 
multidimensional  scaling,  minimal  spanning  trees,  and 
nearest  neighbor  graphs  as  discussed  by  Diaconis 
(1988).  Cohen  and  Mallows  (1980)  propose  graphical 
methods  based  on  multi-dimensional  scaling  and 
biplots.  Cohen  (1990)  presents  alternate  exploratory 
data  techniques  for  ranked  data.  - 

In  this  paper,  graphical  methods  are  developed 
to  display  frequency  distributions  of  fully  ranked  data 
by  using  permutation  polytopes.  A  polytope  is  the 
convex  hull  of  a  finite  set  of  points  in  R”,  and  a 
permutation  polytope  is  the  convex  hull  of  the  n! 
points  in  R'^  whose  coordinates  are  the  permutations 
of  n  pseudoranks.  To  represent  a  set  of  ranked  data, 
the  frequencies  with  which  the  permutation  are  chosen 
are  displayed,  not  on  a  line  as  is  done  with 
histograms,  but  on  the  vertices  of  the  permutation 
polytope.  The  resulting  graphical  displays  are 
especially  useful  as  diagnostic  tools  because  they  are 
compatible  with  two  commonly  used  metrics  on  S„: 
Kendall’s  r  and  Spearman’s  p.  Both  the  r  and  p  are 
easily  interpreted  on  the  permutation  polytope. 

The  permutation  polytope  on  which  the  n! 
frequencies  are  displayed  is  inscribed  in  a  sphere  in  an 
n  —  1  dimensional  subspace  of  R”,  as  noted  by 
McCullagh  (1990)  for  ordinary  ranks,  in  such  a  way 


to  be  compatible  with  both  Kendall’s  r  and 
Spearman’s  p.  Hence,  for  n>4,  the  problem  of 
visualization  of  points  on  a  polytope  in  higher 
dimensions  must  be  addressed.  One  approach  to  this 
problem  is  to  explore  a  higher  dimensional  polytope 
by  examining  its  three  and  four  dimensional  faces. 
By  defining  a  permutation  polytopte  as  the  solution  to 
a  finite  set  of  linear  inequalities,  all  of  the  faces  can  be 
characterized.  In  particular,  it  is  shown  that  all  two- 
dimensional  faces  are  combinatorily  equivalent  to 
either  squares  or  hexagons,  and  all  three  dimensional 
faces  are  combinatorily  equivalent  to  either  truncated 
octahedrons,  cubes,  or  hexagonal  prisms. 


2.  PERMUTATION  POLYTOPES  FOR  n=3,4 

Before  developing  the  concepts  needed  for  the 
proposed  graphics  for  either  n>4  or  for  pseudoranks, 
we  illustrate  the  proposed  technique  with  ordinary 
ranks  for  n=3  and  n=4.  Ranked  data  c?n  be  recorded 
either  as  an  ordering  or  as  a  ranking.  Items  are 
labeled  with  letters,  and  orderings  are  denoted  by 
permutations  of  the  first  n  letters,  bracketed  by  <  >  . 
For  example,  <b,c,a,d>  means  that  item  b  is  ranked 
first,  and  item  d  last.  A  ranking  is  a  permutation  of 
n  values  written  as  a  row  vector  x  =  (»j,  ...  ,  Xj,) 
where  Xj  is  the  rank  of  the  i*^^  item.  The  ranking 
corresponding  to  <b,c,a,d>  is  (3, 1,2,4). 

Figure  1  shows  the  orderings  and  rankings  of  the 
6  elements  of  Sg.  Two  adjacent  points  are  connected 
by  an  edge  if  their  orderings  differ  by  a  pairwise 
adjacent  transposition,  or  equivalently,  if  their 
rankings  differ  by  the  inversion  of  two  consecutive 
values.  Hence,  the  minimum  number  of  edges  that 
must  be  traversed  to  get  from  one  vertex  to  another  is 
equal  to  Kendall’s  r.  Formally,  if  x  and  q  are  two 
rankings,  then  T(x,g)  is  the  number  of  pairs  (ij)  such 
that  ’rj<Xj  and  This  is  equivalent  to  the 

minimum  number  of  j^irwise  adjacent  transpositions 
needed  to  change  the  ordering  corresponding  to  x  into 
the  ordering  corresponding  to  q.  The  placement  of 
the  vertices  in  Figure  1  is  also  compatible  with 
Spearman’s  p: 

p(T.?)=(.E(’rj-<rj)^)  '  • 


If  the  edges  of  the  regular  hexagon  are  all  of  length 
then  Spearman’s  p  is  the  Euclidian  distance 


Graphics  for  Rankings  47 


between  two  vertices.  Note  that  the  two  vertices  on  a 
common  edge  have  the  same  item  ranked  either  first 
or  last. 

These  ideas  extend  to  n=4  by  plsicing  the  24 
I>ermutations  on  the  vertices  of  a  truncated 
octahedron,  as  shown  in  Figure  2  [  Yemelichev  et.  al. 
(1984)].  The  truncated  octahedron  has  8  hexagonal 
faces  and  6  square  faces.  As  in  Figure  1,  r  is  the 
minimum  number  of  edges  that  must  be  traversed  to 
get  from  one  vertex  to  another,  and  p  is  the  Euclidian 
distance  between  two  vertices  if  each  edge  has  length 
^  [cf.  Schulman  (1979)].  On  the  truncated 
octahedron,  the  4  vertices  of  a  square  have  the  same  2 
items  ranked  in  the  first  2  positions  and  the  other  2 
items  ranked  in  the  last  2  positions.  Similarly,  the  6 
vertices  of  a  hexagon  all  have  the  same  item  ranked 
either  first  or  last.  The  idea  that  each  face  has  a 
“defining  property”  is  fundamental  in  the  proposed 
graphical  methods  for  n>4. 

For  n  =  3,  consider  the  data  of  Duncan  and 
Brody  (1982)  in  which  1439  people  ranked  city, 
suburban,  and  rural  living  in  order  of  preference.  The 
current  residence  is  also  recorded  as  a  covariate.  For 
each  covariate,  the  relative  frequencies  of  each 
permutation  were  calculated.  In  Figure  3  these 
relative  frequencies  are  plotted  on  the  vertices  of  3 
hexagons.  Each  hexagon  corresponds  to  a  covariate, 
and  the  sizes  of  the  circles  at  the  vertices  indicate  the 
relative  values.  It  is  immediately  obvious  that  rural 
and  suburban  residents  are  similar  to  each  other,  but 
are  both  different  from  city  dwellers.  Those  who 
prefer  the  city  most  seem  to  live  in  the  city. 
Relatively  few  rural  and  suburban  dwellers  prefer 
their  current  location  least,  while  many  city  dwellers 
would  rather  be  anyplace  else.  For  n  =  3,  this 
proposed  graphical  technique  is  similar  to  the  graphics 
of  Cohen  and  Mallows  (1980)  in  which  circles  with 
areas  proportional  to  the  frequencies  are  placed  at  the 
ends  of  6  vectors  radiating  from  the  origin. 

The  plotting  of  ranked  data  with  n  =  4  on 
truncated  octahedrons  is  illustrated  by  the  following 
example.  At  the  start  of  a  literary  criticism  course, 
38  students  read  the  short  story  and  ranked  4 
different  styles  of  literary  criticism  in  order  of 
preference.  At  the  end  of  the  course,  they  read 
another  short  story  and  again  ranked  the  same  four 
styles  of  literary  criticism.  The  4  styles  were 
authorial  (a),  comparative  (c),  personal  (p),  and 
textual  (t);  and  the  question  of  interest  was  whether 
or  not  the  post-course  rankings  had  moved  in  the 
direction  of  the  teacher’s  own  preferred  ordering 
<p,c,a,t>  [see  Critchlow  and  Verducci  (1989)].  The 


frequencies  of  the  38  pre-course  rankings  are  shown  in 
Figure  4a  and  the  38  post-course  rankings  are  shown 
in  Figure  4b.  Most  obviously,  the  frequencies  do 
change  a  great  deal  between  the  two  sets  of  rankings. 
First,  there  is  an  increase  in  the  frequencies  at  the  6 
vertices  that  correspond  to  orderings  that  begin  with 
c.  The  post-course  ranking  do  not  seem  to  have 
moved  toward  the  teacher’s  preferred  ranking, 
<p,c,a,t>,  but  as  concluded  by  Critchlow  and 
Verducci  (1989),  they  appear  to  be,  over  all,  closer  to 
<p,c,a,t>  than  are  the  pre-course  rankings.  The 
orderings  seem  to  have  moved  toward  <c,p,t,a>. 
McCullaugh  and  Ye  (1990)  illustrate  a  similar 
conclusion  by  plotting  the  vectors  of  the  average  pre- 
and  post-course  ranking  on  a  truncated  octahedron. 
Other  observations  are  1)  the  frequencies  at  the  6 
vertices  corresponding  to  the  ordering  ending  in  (c) 
decrease;  2)  style  (a)  is  rarely  chosen  as  either  a  first 
or  second  choice  after  the  course  is  completed;  and  3) 
the  incidence  of  style  (t)  as  a  first  choice  decreases. 

To  make  the  plots  perceptually  accurate,  the 
areas  of  the  circles  in  Figures  3  and  4  are  based  on 
Steven’s  Law  which  says  that  the  perceived  scale,  p,  of 
the  size  of  an  area  is 

p  <x  (area)'* 

(Cleveland,  1985).  Hence,  the  areas  of  the  circles  are 
calculated  as  ,  „ 

area  a 

where  f  is  the  frequency.  If  the  areas  are  proportional 
to  the  values,  i.e.,  area  «  f,  then  small  circles  appear 
too  large  and  large  circles  appear  too  small. 
Conversely,  if  the  radius  of  the  circle  is  proportional 
to  the  frequency,  i.e.,  area  oc  f^,  then  large  values  are 
magnified  and  small  values  are  minimized. 

3.  PERMUTATION  POLYTOPES  FOR  n  >  4 

Instead  of  using  the  integers  from  1  to  n,  some 
applications  use  pseudoranks  in  which  a  ranking  is  a 
vector  whose  elements  are  a  permutation  of  n  distinct 
values,  and  an  ordering  is  a  permutation  of  the  n 
items  such  that  the  i*^  item  is  assigned  the  i*^ 
smallest  pseudorank.  Without  loss  of  generality, 
assume  that  the  psuedoranks  are  aj>a2>...>ajj  >  0. 
The  ordinary  ranks  are  aj  =  n  — i4-l.  To  extend 
Spearman’s  p  to  pseudoranks,  let  a(T)  =  (a(jrj), 

a(7r2),  ...,  a(Tn))  and  a(?)  =  (a(<Tj),  a(o-2) . a((rn)) 

be  two  rankings  where  t  and  q  are  elements  of  S„. 
Then  „ 

P(a(ir),a(?)  )  =  (  E(aT.  "  )  • 

'  i=l  •  \  ' 

Next,  as  in  Schluman  (1979),  consider  the  set  of 
vectors  in  R”  whose  elements  are  permutations  of  the 
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pseudoranks.  These  points  lie  in  the  intersection  of 
the  sphere  „ 

53(xj-a)2=  i;(aj-a)2 
i=l  i=l 

and  n  —  1  dimensional  hyperplane 

n  _  _  1  " 

X;  =  n  a  where  a=n'^5^a; 
i=l  i=l 

The  permutation  polytope  is  the  convex  hull  of  these 
points,  a'  H  it  can  be  mapped  into  via  the 

Helmert  transformation.  Because  it  is  an  orthonormal 
mapping,  it  preserves  distance  and  angles,  so  that  the 
poly  tope  is  still  inscribed  in  a  sphere  in  R'^'^and 
Spearman’s  p  (which  is  the  Euclidian  distance 
between  two  points)  is  preserved.  When  ii=4  and 
aj  =  n  — i+1,  calculations  show  that  the  resulting 
polytope  is  a  truncated  octahedron  whose  vertices  are 
exactly  the  vectors  of  permutations  and  whose  edges 
are  all  of  length  ■\f2. 

For  n  >  4,  the  proposed  graphical  methods 
require  that  we  relate  Kendall’s  r  to  the  permutation 
polytope,  and  that  we  characterize  all  the  faces, 
particularly  in  3  dimensions.  Answers  to  both 
problems  are  found  in  Chapter  5  of  Yemelichev  et.  al. 
(1984)  in  which  a  permutation  polytope  is  shown  to 
be  equivalent  to  the  following  system  of  constraints: 

|w| 

(1)  for  all  w  e  {1,  2,  ...,  n} 
i  €  ui  i=l 

n  l«l 

(2)  E  Xj  =  E  a;  ■ 

i  =  1  i=l 

For  a  given  n,  the  faces  are  characterized  by 

Theorem  3.4  of  Section  5  which  proves  that  the  set  of 
solutions  to  (1)  and  (2)  is  an  i-dimensional  face  (i- 
face),  0  <  i  <  n-2,  if  and  only  if  for  each  such  solution 
the  inequalities  in  (1)  are  satisfied  as  equalities  for 
subsets  Wj,  Wj,  ...,  of  {l,2,...,n}  such  that 

u)j  C  Wj  C  ...  C  Wjj  j  j  C  u)  .  =  {l,2,...,n}. 

To  use  this  theorem,  first  define  Qu  = 

1  <  k  <  n-i.  Then,  the  0-faces  (vertice^  are  exactly 
the  n!  points  whose  elements  are  permutations  of  the 
pseudoranks  because  each  Qj  contains  exactly  one 
element.  Similarly,  for  a  1-face,  one  of  the  Q|^’s  has  2 
elements  and  the  others  each  have  exactly  one. 
Hence,  Corollary  3.9  of  Section  5  proves  that  2 
vertices  of  a  permutation  polytopie  are  adjacent  (on 
the  same  1-face)  if  and  only  if  they  differ  by  a  single 
transposition  of  a|j  and  l<k<n-l.  For 

ordinary  ranks,  we  now  have  that  Kendall’s  r  is  equal 
to  the  minimum  number  of  edges  (1-faces)  that  must 
be  traversed  to  get  from  one  point  to  another.  This 


also  extends  Kendall’s  r  to  pseudoranks  in  an  obvious 
manner  that  warrents  more  study.  Similarly, 
Theorem  3.4  shows  that  every  2-face  is  either  a 
hexagon  if  all  have  one  element  except  one  which 
has  3  elements  (^  that  all  but  3  of  the  orderings  are 
fixed),  or  a  square  if  all  Q.  have  one  element  except 
for  2  of  them  which  each  nave  2  elements.  The  3- 
faces  correspond  to  truncated  octahedrons  if  all  Q. 
have  one  element  except  one  (so  that  all  but  four  m 
the  orderings  are  fixed),  to  cubes  if  all  Qj^  have  one 
element  except  3  which  have  2  elements  each,  and  to 
hexagonal  prisms  if  all  Qj^’s  have  one  element  except 
2,  one  which  has  2  and  one  which  has  3  elements. 

Thus,  all  3-dimensional  faces  of  any  permutation 
polytope  can  be  characterized  and  the  data  can  be 
illustrated  by  a  sequence  of  3-dimensional  polytopes  in 
which  the  frequencies  are  plotted  on  the  appropriate 
vertices.  Frequently,  it  is  useful  to  also  plot  portions 
of  the  4-dimensional  p>olytopes. 
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Abstract 

Quantile  plots  are  used  to  display  data  for  better 
understanding  and  comparison  of  distributions.  Splitting  the 
quantile  plot  by  a  categorical  variable  helps  one  visualize  an 
analysis  of  variance.  Plots  of  of  rank-transformed  data 
corresponds  to  non-paramctric  methods  and  can  also  aid  in 
the  analysis  of  categorical  data.  As  less  abstract  and  more 
direct  presentations  of  data  than,  for  example,  box  plots, 
quantile  plots  can  be  more  effective,  in  particular,  when 
presenting  to  non-statisticians. 

Definition 

This  paper  will  prc.scni  quantile  plots  as  a  method  of 
plotting  actual  data  side  by  side  in  a  way  that  is  easily 
presentable  to  anyone,  regardless  of  their  statistical  training. 
One  simply  plots  the  data  points  as  an  empirical  quantile 
function  [Par/.cn,  Cleveland]  which  is  the  plot  of  the  value 
of  each  observation  on  the  vertical  scale  against  the  rank 
within  the  sample  on  the  horizontal  scale.  The  idea  is  that 
the  random  variable  V  is  a  function  on  the  unit  interval, 
10,11.  It  is  clo.sely  related  to  the  empirical  cumulative 
distribution  of  V  with  the  horizontal  and  vertical  axes 
Hipped.  One  should  be  careful  not  to  confuse  the  meaning 
of  quantile  plot"  in  this  paper  with  the  common  "quantile- 
quaiuile  plot"  or  "q-q  plot"  which  has  a  slightly  different 
defmiiion  and  different  u.sc.  Here,  the  former  is  a  special 
cases  of  the  latter,  i.c.,  a  q-q  plot  with  uniform  quantiles  on 
the  horizontal  axis.  This  paper  also  pertains  only  to  the  use 
of  quantile  plots  and  docs  not  involve  quantile  functions  or 
their  estimation,  (jParz^n)  has  more  sophisticated  uses  for 
quantile  functions  and  related  constructs.) 

In  Figure  1,  "Score  at  Week  3",  each  observation  is  a 
single  diamond.  When  N  gets  to  be  very  large,  the  points 
tend  to  meld  together,  depending  on  the  resolution  of  one's 
graphics  device.  But  with  such  large  N,  the  empirical 
distribution  should  be  closer  to  the  true  distribution.  Any 
quantile  of  the  distribution  is  readable  from  this  graph,  in 
particular,  the  median,  which  is  the  quantile  at  .5  .  The 
l(Kal  density  is  the  inverse  of  the  local  slope  of  the  quantile 
function  so  that  ranges  where  the  slope  of  the  points  is  low 
arc  regions  of  higher  density.  Extreme  outliers  and  multiple 
modes  are  often  obvious  to  the  eye.  The  use  of  points 
makes  the  amount  of  ink  used  to  print  the  points 
proportional  to  the  size  of  the  sample,  a  desirable  properly. 
Connecting  the  points  with  lines  w  )uld  confuse  this 
ink/observations  ratio  and  also  emphasize  what  could  be  an 
incorrect  interpolation. 

One  can  easily  add  many  features  to  represent  various 
summary  statistics.  The  interpretation  of  the  symbols  in 
these  i|uaniilc  plots  is  as  follows: 

•  The  data  points  themselves  arc  small,  hollow 
diamonds.  Diamonds  more  precisely  indicate  position 
than  do  squares  or  circles.  Also,  they  arc  two¬ 


dimensional  which  crosshairs,  X's,  and  asterisks  are 
not.  The  hollowness  allows  points  to  overlap 
without  much  loss  of  ink  area. 

•  For  reference,  crosshairs  arc  plotted  at  the  quantiles  of 
.05,  .25,  .50,  .75,  and  .95.  They  are  slightly  larger 
than  the  diamonds,  so  as  to  show  up  in  plots  with 
many  points,  but  do  not  add  any  more  two- 
dimensional  images  to  the  picture. 

•  Small  dots  arc  placed  at  the  comers  of  the  "box"  of 
the  traditional  box  plot.  (These  dots  may  be  too 
small  to  show  well  in  this  printing.) 

•  At  the  far  left  arc  five  cross-hairs  which  represent  the 
mean  and  one  and  two  standard  errors  (not  standard 
deviations).  The  choice  to  plot  both  one  and  two 
standard  errors  was  to  make  it  unambiguous  as  to 
how  many  standard  errors  were  represented. 

•  At  the  far  right,  the  crosshairs  indicate  the  endpoints 
of  a  non-paramctric  95%  confidence  interval  for  the 
median. 

•  A  reference  line  of  dots  lies  on  the  diagonal  for  visual 
anchoring.  The  diagonal  line  can  be  a  great  aid  in 
comparing  different  quantile  plots. 

TTie  primary  purpose  of  these  plots  is  to  emphasize  the 
overall  shape  of  the  distribution  and  adding  too  many  extra 
symbols  wiil  distract  the  eye  from  this  purpose.  There  arc 
other  common  statistics  which  arc  left  out: 

•  The  standard  deviation:  First,  for  .skewed  data,  the 
standard  deviation  marks  could  extend  over  the 
boundary  of  the  plot  on  the  high  or  low  side, 
possibly  onto  other  sections  of  the  graph.  Second, 
all  information  on  the  variability  is  contained  in  the 
quantile  plot  itself  and  the  information  from  the 
standard  deviation  wiil  be  redundant.  If  it  is 
important  to  a  specific  presentation,  the  standard 
deviation  is  easy  to  add. 

•  Quantile  points  at  a  variety  of  locations  (.05,  .15, 
.25, ...).  When  included  in  a  narrow  range  between  .4 
and  .6,  these  points  tended  to  clutter  the  plot . 

•  Altering  .shapes,  coloring,  and  shading  of  points  was 
rejected  in  favor  of  having  all  points  have  equal  visual 
impact  and  thus,  equal  importance. 

•  Including  a  smoothed  version  of  the  quantile  function 
is  certainly  possible,  but  then  one  must  make  a 
choice  of  smoothing  method.  A  recent  example  of 
one  such  method  can  be  found  in  [Yang]. 

Here  the  reader  may  ask,  "Why  not  use  cumulative 
distribution  plots?"  The  vertical  orientation  of  the  quantile 
plot  brings  gravity  inherent  in  the  page  into  play:  areas  of 
lower  slope  are  more  "stable"  spots.  In  a  cdf,  a  variable 
which  tends  to  have  "higher"  values  has  a  "lower"  cdf,  while 
in  a  quantile  plot,  "higher"  really  means  "higher"  in  birth 
scmscsof  the  word.  If  wc  truly  think  of  the  random  variable 
as  a  function,  standard  convention  puts  the  function  value  or 
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range  on  the  vertical  axis  and  domain  on  the  horizontal. 
One  more  point  about  gravity:  in  a  histogram  with 
horizontal  bars  or  a  stem-and-le^  plot  the  larger  bars  might 
look  as  if  they  will  break  and  fall  off. 

Splitting  Plots  and  Ranked  Values 

By  spliuing  the  graph  into  separate,  parallel  graphs  by 
different  groups,  one  can  perform  a  visual  analysis  of 
variance  to  supplement  an  ANOVA  table  and  to  provide 
more  impact  to  a  presentation.  There  are  four  v  ays  to  split 
the  graph:  along  the  horizontal  axis,  along  the  vertical  axis, 
across  pages,  and  overlayed.  It  is  best  to  split  horizontally 
by  the  the  variable  of  most  interest,  for  the  mean/standard 
error  and  median/confidence  marks  will  line  up  for  easy 
comparison.  Overlaying  one  quantile  graph  over  another  for 
comparison  purposes  has  great  appeal,  but  it  will  also  cause 
crowding  if  the  two  quantile  functions  are  very  close.  How 
to  differentiate  between  the  symbols  from  two  or  more 
separate  sets  of  observations  in  an  obvious  way  is  an 
additional  complication.  Splitting  vertically  makes  sense 
for  confounding  variables  where  tests  for  differences  are  less 
important. 

Figure  2,  "Score  at  Week  3",  contains  the  same  data  as 
in  Figure  1,  but  this  time  split  by  the  two  categorical 
variables  Treatment  and  Sex.  Note  that  there  are  about  half 
as  many  males  as  females,  as  indicated  by  fewer  points  in 
the  cells  for  males.  Also,  the  means  and  standard  errors,  on 
the  left  side  of  each  cell,  indicate  that  treatment  1  is  more 
effective.  A  higher  score  means  worse  in  this  variable,  so 
that  the  further  below  the  diagonal  line  the  quantile  plot  lies, 
the  better  off  the  patients  are.  It  is  important  to  remember 
that  the  means,  standard  errors,  and  all  other  summary 
statistic  symbols  in  the  plot  are  not  based  on  any  particular 
model  but  only  the  data  in  each  cell  alone. 

One  could  also  plot  rank-transformed  data  to  graphically 
look  at  a  non-parametric  Kruskal-Wallis  ANOVA.  In  this 
case,  the  diagonal  line  enhances  the  plot  because  it 
represents  a  theoretical  distribution  of  rank  transformed 
values.  Here,  ties  are  assigned  the  mean  rank  but  the  range 
of  the  plot  runs  from  0  to  N,  or  0  to  1  if  the  ranks  are 
divided  by  N.  When  split,  deviations  from  the  overall 
distribution  show  as  more  points  above  or  below  the 
diagonal  line.  Figure  3,  "Score  at  Week  3  (Ranked 
Values)",  is  again  based  on  the  same  data  using  the  ranks  of 
the  values  within  the  whole  sample  rather  than  the  values 
themselves.  Note  that  the  values  at  the  top  have  been 
squeezed  together  and  are  no  longer  evenly  spac^. 

The  plotting  of  rank  transformed  data  is  also  useful  for 
ordered  categorical  data,  which  includes  dichotomous  data. 
However,  one  should  remember  to  use  mean  rank  so  that  the 
points  will  not  end  up  all  at  the  top  or  the  bottom  of  the 
cell,  possibly  merging  with  points  from  another  cell.  For 
ranked  values  the  diagonal  line  should  go  throiigh  the  centers 
of  each  level  overall.  In  Figure  4,  "Outcome  (Ranked 
Values)",  there  is  a  single  dichotomous  outcome  variable. 


Treating  it  as  a  nummcal  variable  and  creating  quantile  {dots 
of  the  ranked  values,  we  have  one  way  of  graphing 
categorical  data  in  a  2  by  2  by  2  table.  Note  that  the  upper, 
right-hand  cell  has  fewer  observations  and  the  rows  of 
diamonds  reflect  the  relative  proportion  of  observations  with 
each  of  the  outcomes.  The  upper,  right-hand  cell  has  a 
lower  rate  of  high  outcome.  But  to  reiterate,  the  means  and 
standard  errors  are  based  only  on  each  cell,  not  on  any 
overall  model. 

Comparison  to  the  Box  Plot 

Often  one  starts  looking  at  data  with  a  traditional  box 
plot  [Tukey],  but  some  have  been  looking  for 
improvements.  One  possibility  is  altering  the  shape  of  the 
box  to  show  density  [Benjamini]  which  requires  some 
density  estimation.  One  of  the  problems  is  the  visual 
difference  between  the  "box"  and  the  "whiskers".  What  is 
the  intuitive  meaning  of  representing  the  middle  half  of  the 
data  with  a  two  dimensional  object  and  other  subsets  with 
one  dimensional  objects?  Also,  computer  statistics 
packages  can  be  inconsistent  in  their  calculation  of  the 
length  of  the  whiskers  [Frigge,  et  al]. 

The  box  plot  also  does  not  readily  reflect  the  actual 
number  of  observations,  N.  Some  try  to  remedy  this  by 
letting  the  width  of  the  box  be  proportional  to  the  square 
root  of  N.  However,  with  this  alteration,  different  box  plots 
are  no  longer  comparable  visually.  In  the  quantile  plot  there 
is  no  need  to  m^e  a  mental  transformation  from  width 
(VN).  or  whatever,  to  N.  Sometimes  there  are  confidence 
regions  around  the  median  with  either  "notches"  in  the  box, 
or  shaded  blocks;  but  the  notches  or  shaded  blocks  alter  the 
visual  weight  of  the  primary  feabues  of  the  box  plot. 

The  box  plot  is  an  absuact  picture  based  on  a  handful 
statistics  calculated  from  the  data.  There  is  a  reduction  of 
information  in  the  transformation,  which  is  fine  if  these 
statistics  are  the  right  statistics.  However,  if  the 
distribution  has  certain  peculiarities,  those  handful  statistics 
may  not  reflect  important  features  and  instead  present  an 
inaccurate  picture. 

Of  course,  the  combination  of  a  box  plot  with  a  stem- 
and-leaf  plot  or  histogram  will  give  more  information,  but 
there  are  some  drawbacks: 

•  The  combination  requires  two  graphs  and  uses  more 
space  and  paper. 

•  The  histogram  implicitly  requires  a  choice  of  division 
points  which  is  a  smoothing  decision.  Likewise,  the 
stem-and-leaf  plot  also  has  implicit  smoothing  and 
often  must  round  values  to  a  convenient  number  of 
significant  digits 

•  By  continually  varying,  each  quantile  plot  will  be 
nearly  unique.  The  endless  variety  of  plots  may  hold  an 
audience's  attention  longer  because  human  beings  tend 
to  notice  and  be  more  curious  about  variety. 
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Figure  5A,  ”Box  Plots",  contains  two  groups  of  data. 

The  first  thing  to  notice  is  that  group  B  significantly 
higher  values  and  is  more  spread  out.  Group  A  looks  to 
have  some  outliers  on  both  the  high  and  low  ends. 

Figure  5B,  "Quantile  Plots  of  SA".  contains  the  same  | 

data  but  shows  a  different  picture.  Group  B  actually  has  j 

what  looks  to  be  a  bimodal  distribution  with  the  median  : 

falling  nearly  halfway  between  the  two  modes.  This 
property  did  not  show  up  in  the  box  plot.  Additionally,  in 

group  A,  three  suspected  outliers  on  the  high  side  actually  j 

turn  out  to  be  about  10%  of  group  A.  The  box  plot  was 
using  single  asterisks  to  indicate  what  was  actually  more 

than  one  observation.  As  it  turns  out,  after  discussing  this  ! 

with  the  client,  there  was  a  systematic  problem  in  our  \ 

definition  of  this  variable  which  lead  to  the  suspicious 
distribution.  The  additional  detail  in  the  quantile  plot  helped 
identify  and  explain  the  problem  much  sooner. 

Summary 

The  quantile  plot  is  a  less  abstract  presentation  of  an 

empirical  distribution  than  the  traditional  box  plot  It  | 

presents  a  picture  closer  to  the  statistician's  own  mental 
picture  of  the  data  and  analyses.  Because  it  displays  each 

observation  and  not  just  an  object  created  from  certain  | 

statistics,  it  may  be  better  for  presentaton  of  data  to  non-  J 

statisticians.  Finally,  quantile  plots  can  show  features  of  ' 

the  data  that  might  be  hidden  by  other  methods,  including 
problems  resulting  from  bad  data  coding  or  calculation 
errors. 
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Abstract 

"  Extending  the  work  of  Wachter  (1978,  1980)  and 
many  others,  we  study  the  configuration  of  the  singular  yal- ^ 
ues  (s.v.’s)  of  an  a  by  f)  matrix  of  the  form  X=M4<sZ 
where  M  is  a  constant  matrix,  and  the  elements  of  Z  are 
i.i.d.,  standard  Gaussian,  in  the  limit  as  a  and  b  increase  ii^^ 
constant  ratio.  We  put  N-^a  +  b  and  suppose  a^aN, 
with  a  of  order  1  / V/7 .  Ixt  the  empirical  distribu¬ 
tion  of  the  s.v.’s  of  be  G/v,  and  let  the  corresponding 
moment-generaling-function  (m.g.f)  be  gw(/).  These  are^ 
random  quantities;  their  distributions  depend  only  onC^ $hd 
the  empirical  distribution  F ^  of  the  s.v.’s  of  M.  We  derive  a 
differential  equation  that  governs  the  evolution  of  E(gf/)  as 
a  increases.  In  the  limit  as  A^-4«>  we  can  solve  this  equa¬ 
tion  and  hence  exhibit  the  limiting  (non-random)  g  itself. 

This  study  was  motivated  by  some  blood-pressure 
data  collected  by  a  new  type  of  transducer.  It  suggests  a 
novel  way  of  adjusting  large  matrices  to  reduce  the  effect  of 
additive  contamination.  ... 


1.  Introduction 

In  the  standard  technique  for  measuring  blood  pres¬ 
sure,  a  pressure  cuff  is  applied  to  the  upper  arm,  inflated  to 
consuict  the  artery,  and  deflated  while  a  technician  listens 
(through  a  stethoscope)  for  the  so-called  Korotkoff  signal. 
A  novel  form  of  transducer  now  allows  the  recording  of  a 
continuous  trace  of  inaudible  low-freq  'ency  auditory  data, 
thus  affording  a  first  glimpse  of  the  details  of  the  process. 
Figure  1  shows  such  a  record,  segmented  into  individual 
heartbeats.  Cuff  pressure  decreases  down  the  figure. 

An  early  attempt  to  analyze  such  data  consisted  of 
regarding  Figure  1  as  a  display  of  the  rows  of  a  70x373 
matrix  X.  We  performed  a  singular-value  decomposition  in 
the  hope  that  an  additive  representation  of  the  form 

^  ii= 

t=i 


Figure  1 


with  m  not  too  large,  would  fit  the  data;  here  each  row  of  the 
matrix  B  =  {biij)  is  a  prototypical  component  of  a  heartbeat 
trace,  and  the  columns  of  A  =  (ai*)  show  how  these  compo¬ 
nents  enter  and  leave  during  the  evolution  of  the  traces.  By 
convention,  the  rows  of  B  and  the  columns  of  A  are  stan¬ 
dardized  to  unit  length;  the  magnitudes  of  the  coefficients 
{ct}  measure  the  importance  of  the  components.  It  is  a 
property  of  the  singular-value  decomposition  of  a  matrix 
that  the  best  (least-squares)  representation  of  the  form  (1), 
using  m  terms,  is  obtained  by  taking  the  first  m  components 
of  the  singular-value  decomposition 

X=ACB^ 

where  aM  =B^B  =I,,C  =  diagonal,  where  r  is  the  rank  of 
X.  It  would  be  pleasant  to  find  that  a  small  value  for  m  suf¬ 
fices  to  give  a  good  fit  to  the  data. 

On  performing  the  calculation,  we  found  that  a  few  of 
the  singular  values  were  quite  large,  while  most  were  small. 
We  were  faced  with  the  problem  of  deciding  how  many 
components  to  use. 


Singular  Values  55 


2.  An  idealized  problem. 

As  an  idealization  of  this  set-up,  suppose  we  have 
observed  an  axb  matrix  X  with  the  structure 

X=M+aZ 

where  the  elements  of  Z  are  independent  standard  Gaussian. 
How  do  the  singular  values  of  X  depend  on  those  of  Af?  We 
formulate  this  as  an  asymptotic  question.  Suppose  a<b, 
and  putA/=a  +  i,  a=aN,b=^N.  We  study  the  asymptotic 
configuration  of  the  singular  values  (s.v.’s)  of  an  a  x  h 

matrix  of  the  form  X=M  +  -^Z  where  M  is  a  constant 

matrix,  and  the  elements  of  Z  are  i.i.d.,  standard  Gaussian. 
Suppose  X  has  singular  values  JCi ,  ■  -  ■  ,0:^.  It  is  convenient 
to  woric  with  a  symmetrized  form  of  the  empirical  distribu¬ 
tion  of  these  s.v.’s,  namely 

G^\x)=j(^I(-Xi<x)+(b-a)l(x>0)+'ZnXi<x)) 

We  also  need  the  generating  function  (modified  Stieltjes 
uansform) 


iqiproximate  methods. 

3.  A  differential  equation 

Since  we  are  assuming  that  Z  is  Gaussian,  we  can 
^peal  to  the  fact  that 

Af +oZ=Af +jZj  +rZ2 

where  a^=s^+t^,  and  Zj  and  Z2  are  independent  Gaus¬ 
sian.  We  can  set  up  a  differential  equation  for  the  expected 
generating  function 

by  retaining  only  the  terms  of  order  in  the  expected 
moments.  We  find 


and  we  need  merely  to  solve  this  equation. 


J  j^dG^(x) 

t=0 

where  X 2*  is  the  2jt-th  moment  of  the  (symmetrized)  distri¬ 
bution  so 

Both  G^^  and  are  random  quantities;  their  distributions 
depend  only  on  a  and  the  (symmetrized)  empirical  distribu¬ 
tion  of  the  s.v.’s  of  Af.  We  define  the  moments  Af  2* 
and  a  generating  function  from  in  a  similar  fash¬ 
ion.  Below  we  shall  let  N-^oa,  and  shall  assume  that  F^^ 
converges  to  a  limiting  distribution  F  that  has  a  moment¬ 
generating  function  /  (with  moments  p.2r)  that  converges 
within  some  non-vanishing  interval. 

Wachter  (1978)  considered  this  problem,  replacing 
the  Gaussian  assumption  by  one  involving  boundedness  of 
moments;  also  he  allowed  the  columns  of  Z  to  have  different 
variances.  However  (in  our  notation)  he  assumed  M~*0  as 
N-^oo,so  that  the  effect  of  Af  was  negligible  in  the  limit.  In 
the  present  work,  the  role  of  Af  is  crucial.  Our  results  seem 
to  be  new.  We  find,  as  did  Wachter,  that  asN-iooGn  con¬ 
verges  to  a  non-random  limit  G  (with  generating  function  g, 
and  moments  Yzt)- 

We  derive  a  differential  equation  for  £(gji^').  We 
cannot  solve  this  in  general;  however  letting  A'— »oo,  we 
derive  a  formula  for  the  limiting  g  as  a  function  of  /  and  a. 
In  principle  this  enables  us  to  calculate  the  density  corre¬ 
sponding  to /  once  g  is  known;  in  practice  (since  N  is  finite) 
this  is  an  ill-conditioned  calculation  and  we  need 


4.  ThecaseN=oo,a=l/2. 

From  (2)  we  have 

da^  2^dx 
with  the  boundary  condition 


Y(0,x)=-/*^Hx) 


(3) 


This  relation  provides  a  rapid  way  of  computing  the 
moments  of  G  from  those  of  F.  The  solution  of  (3)  is 

Y(a^  ,z) = Y^*^  ■^°^  ( z) = Y^*^Hy) 


where 


z=y-(-a^y“^(y) 


When  Af = 0  we  find 

/°Hz)=y(o,z)=^ 

Y(®z)(z)=Y(a^^)=^ 

2y 

where 

z=y+a^Y‘^Hy)=y  +  ^ 

so 

Y(a^  ,z)  =  ^  (z  -  Vz^  -  2a^ ) 
f{oZ)(^x)=-^^2a^-x‘^  0<x<'J2a^ . 


j 


56  L.  Denby  and  C.  Mallows 


This  is  Wigner’s  "semicircle  law",  sec  Mehta  (1967)  and 
Figure  2. 


Figure  2 

5.  The  case  A^  =  oa,a^  1/2. 

Write  5= (a- p)^.  We  find 

zy(a^,z)=yy(0,y)+a^Q 


(4) 


where 


find 


Q=f(a^,z)--^=f(0.y)--^ 

4z  Ay 

When  Af  =0,  take  o=  1,  and  write  y  for  Y(l.z)-  We 
1  1  8 


whence 

f{x)=  — V(B2-x2)(jc2_a2) 
nax 

where 

/l  =  Vp-Va,  B=V^+Va. 

See  Figures  3,4  for  the  cases  a=  .3,  a= .  1. 


(5) 


Figure  3 


Figure  4 

6.  A  special  case. 

Suppose  a  =  1/2,  and  that  all  the  s.v.’s  of  Af  are  equal 
top..  Then P2*  =  2p^*. and 

'/^Ux)=—( — - —  +  — - — ) 

^  2M-XP  1+xp^ 

Thus  if  X = Af  +  (o/>/n  )  Z  (remember  N  =  2a)  we  have  from 

(7) 


y*>(x)= 


where 


1  =  1  +  £1__Z_ 
x")»  2  1-yV^ 

This  relation  holds  within  the  circle  of  convergence.  To  get 
F  itself,  we  need  to  continue  the  definition  outside  this  cir¬ 
cle,  taking  care  to  use  the  correct  branch.  Then  we  ^ply  the 
formula  (see  WachtCT  (1978)) 

g(^)=^Im(y(-i)) 

In  one  case  we  can  get  an  explicit  result,  namely  when 
p2  =  a^/2 

In  this  case 

x=y(l-yV^) 

so  that 

/(l/^)=y2^ 

Thus  we  need  only  solve  a  cubic  equation.  Writing 
y=(2/pV3j sin0,  we  have  sin30=3VT4/24.  We  get  com¬ 
plex  roots  for  l^l<3VTp/2.  For  0<^<3VTp/2  we  put 
0=n/6-i-j\)/ and  find 


where 


/^(0=--z^sinh2\ff 


cosh3v= 


(6) 
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See  Figure  3.  Remember  that  this  is  the  symmetrized  distri¬ 
bution,  so  we  need  to  multiply  by  2  to  get  the  limiting  den¬ 
sity  of  the  s.v.’s  of  X. 

s  1  I 


0.0  1.0  2.0 
s.v.'s,  alpha  >  .5,  M  >  sqrt(N/2)l 


Figure  5 


7.  Statistical  application. 

Suppose  we  compute  the  s.v.’s  of  a  large  matrix,  and 
observe  that  their  empirical  distribution  is  similar  to  (6) 
above.  Then  this  supports  the  view  that  the  matrix  can  be 
regarded  as  the  sum  of  (i)  a  fixed  matrix  with  all  s.v.’s 
equal,  and  (ii)  a  matrix  of  independent  random  variables 
with  equal  variances.  In  more  generality,  if  the  s.v.’s  of  a 
large  square  mauix  have  an  empirical  distribution  G,  we 
would  like  to  estimate  an  F  such  that  the  relation  (4)  is 
approximately  satisfied.  As  yet  we  have  no  detailed  sugges¬ 
tions  as  to  how  to  do  this. 

For  the  70x373  matrix  that  stimulated  this  investiga¬ 
tion,  we  find  that  a  q-q  plot  of  the  70  realized  singular  val¬ 
ues  against  quantiles  of  the  distribution  (S)  (Figure  6)  is 
very  far  from  linear;  the  lowest  30  or  so  s.v.’s  (Figure  7 
shows  40)  do  conform  roughly  to  this  null  prescription,  with 
a  about  63.  But  this  value  for  a  is  much  too  large  to  be  rea¬ 
sonable  for  these  data;  computing  the  the  loot-mean-square 
successive  difference  of  the  rows  of  the  matrix,  we  get  num¬ 
bers  averaging  23,  with  a  maximum  of  43.  We  conclude  that 
for  this  approach  to  work,  we  will  need  an  M  with  very  few 
non-zero  singular  values.  Evidently  this  approach  is 
unsuited  to  these  data. 


0.0  0.4  0.8  1.2 

quantiles,  alpha  =  70/443 


Figure  7 


8.  Final  Comments 

Clearly  this  work  is  incomplete.  Among  the  things 
that  need  doing  are: 

(i)  Extend  the  results  to  dispense  with  the  assumptions  of 
Gaussianity  and  identical  distribution  of  the  elements  of  Z. 

(ii)  Extend  the  results  to  dispense  with  the  assumption  that 
the  moments  of  F  are  finite. 

(iii)  Develop  an  algorithm  to  find  the  density  of  G  directly 
from  the  density  of  F. 

(iv)  Develop  techniques  to  do  (iii)  approximately  (in  some 
tqjpropriate  sense)  in  the  case  N  finite. 

(v)  Solve  (2)  in  general. 

(vi)  Study  the  variability  of  Cf^K 
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Abstract 

'  This  paper  presents  a  method  to  interactively  exploring  a 
large  set  of  quantitative  multivariate  data,  in  order  to  estimate 
the  shape  of  the  underlying  density  function.  It  is  assumed  that 
the  density  function  is  more  or  less  smooth.  The  local  structure 
of  the  data  in  a  given  region  may  be  examined  by  viewing  the 
data  through  a  Gaussian  window,  whose  location  and  shape 
are  chosen  by  the  user.  The  method,  which  is  applicable  in  any 
number  of  dimensions,  can  be  used  to  find  and  describe  simple 
structural  features  such  as  peaks,  valleys,  and  saddle  points  in 
the  density  function,  and  also  extended  structures  such  as 
ridges  and  analogous  structures  in  higher  dimensions.  A 
Gaussian  window  is  defined  by  giving  each  data  point  a  weight 
based  on  a  multivariate  Gaussian  hinction.  The  weighted 
sample  mean  and  sample  covariance  matrix  are  then  com¬ 
puted,  using  the  weights  attached  to  the  data  points.  These 
quantities  are  used  to  compute  an  estimate  of  the  shape  of  the 
density  function  in  the  window  region.  The  local  structure  of 
the  data  is  described  by  a  method  similar  to  the  method  of 
principal  components.  Thus  we  can  apply  our  geometrical 
intuition  to  the  structural  features  we  find  in  the  data,  in  any 
number  of  dimensions.  By  taking  many  such  local  views  of  the 
data,  we  can  form  an  idea  of  the  structure  of  the  data  set  Since 
the  computations  involved  are  relatively  simple,  the  method 
can  be  implemented  on  a  small  computer. 

1  Introduction 

Suppose  that  we  are  given  a  large  set  of  quantitative 
multivariate  data,  say,  N  data  points  Xj  in  a  p-dimensional 
space,  and  that  we  want  to  explore  the  structure  of  the  data. 
That  is,  we  want  to  find  the  shape  of  the  underlying  density 
function,  by  looking  for  concentrations  of  data  points.  We  will 
assume  that  the  density  function  is  more  or  less  smooth,  but  we 

♦Work  reported  herein  was  supported  in  part  by  Cooperative  Agree¬ 
ments  NCC  2-408  and  NCC  2-387  between  the  National  Aeronautics 
and  Space  Administration  (NASA)  and  the  Universities  Space  Re¬ 
search  Association  (USRA). 


will  not  make  any  more  specific  assumptions  about  its  struc¬ 
ture.  To  explore  the  data,  we  need  a  way  to  look  at  the  local 
structure  of  the  data  in  a  limited  region.  So  we  will  examine  the 
data  in  a  given  region  by  viewing  the  data  through  a  Gaussian 
window,  whose  location  and  shape  are  chosen  by  the  user.  We 
will  describe  the  local  structure  of  the  data  by  a  method  similar 
to  the  method  of  principal  components.  By  doing  this  we  will 
be  able  to  find  and  describe  simple  structural  features  in  the 
data  in  any  number  of  dimensions. 

Some  examples  of  the  kinds  of  structures  that  we  can  find 
and  describe  are  the  following;  A  peak,  or  relative  maximum, 
in  the  density  function,  which  would  appear  as  a  cluster  of  data 
points;  a  valley,  or  relative  minimum;  and  a  saddle  point, 
where  the  density  function  would  be  concave  upward  in  some 
directions,  and  downward  mothers.  We  can  also  find  extended 
structures  such  as  a  “ridge”,  or  “bar”,  in  the  data.  A  “ridge”  is 
an  essentially  one-dimensional  structure,  or  concentration  of 
data  points,  consisting  of  data  points  lying  near  a  “center  line” 
but  scattered  about  it  in  all  directions.  Only  a  part  of  such  an 
extended  structure  would  be  visible  in  a  single  window.  In  a 
case  like  this  we  will  be  able  to  tell  that  we  are  looking  at  a 
structure  that  extends  beyond  the  window.  We  can  then  follow 
along  it  and  map  out  its  extent  and  shape.  Similarly,  we  might 
find  an  essentially  k-dimensional  structure  in  a  p-dimensional 
space,  to  any  k<p. 

By  taking  many  local  views  of  the  data,  that  is,  by 
exploring  the  data  interactively,  we  can  build  up  an  idea  of  the 
structure  of  the  data  set.  With  some  practice,  we  can  apply  our 
geometrical  intuition  to  the  features  we  find  in  the  data,  in  any 
number  of  dimensions.  Since  the  computations  are  relatively 
simple,  the  method  can  be  implemented  on  a  small  computer. 

The  approach  here  is  different  from  that  in  the  many 
graphical  methods  that  involve  projecting  the  data  onto  a  space 
of  lower  dimension.  See  for  example  Chambers  et  al.  (1983) 
and  Cleveland  and  McGill  (1988).  However,  such  graphical 
methods  can  be  used  in  conjunction  with  the  method  described 
here. 

The  ideas  outlined  in  this  paper  are  treated  more  thor¬ 
oughly  in  Jaeckel  (1990). 
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2  The  Gaussian  window 

To  focus  on  a  limited  region  in  the  space,  we  usea  window. 
A  Gaussian  window  is  defined  by  choosing  a  center  point  a 
and  a  non-negative  definite  symmetric  matrix  V  to  describe 
its  size  and  shape.  Let 

-i(x-ayV(x-a) 
w(x)  =  e  2 

where  x  is  a  p-vector  and  “prime”  means  “transpose”.  The 
matrix  V  is  analogous  to  the  inverse  of  a  covariance  matrix. 
Each  data  point  x;  is  given  the  weight  Wj  =w(xi).  Note  that 
w(a)  =  1 ,  that  w(x)  <  1  for  all  x,  and  that  w(x)  decreases  as 
X  moves  away  from  a.  Thus  we  have  defined  a  window  with 
“fuzzy”  boundaries.  The  function  w(x)  may  be  thought  of  as 
the  relative  transparency  of  the  window  at  x. 

We  then  compute  the  weighted  sample  mean  vector, 

and  the  weighted  sample  covariance  matrix, 

Sw  =  ^  5:wi(Xi  -  X  J(Xi -  X,)' . 

We  also  compute  (1/N)2iwi . 

These  quantities  are  the  simplest  things  to  compute,  espe¬ 
cially  in  a  high-dimensional  space.  They  describe  the  overall 
shape  of  the  weighted  data  in  the  “window  region”  (the  region 
vaguely  defined  as  the  region  where  w(x)  is  “not  small”).  The 
estimated  shape  of  the  density  function  in  the  window  region 
will  be  based  on  these  quantities.  Note  that  these  quantities  are 
overall  statistics;  any  “fine  structure”  in  the  region  is  smeared 
out.  To  look  for  finer  details,  we  would  use  smaller  windows. 


3  Example:  a  cluster 

Suppose  that  in  the  region  of  a  window,  the  density 
function  has  approximately  a  multivariate  Gaussian  shape: 


f(x)  =  C 


1  -11) 


where  p,  Z,  and  C  are  all  unknown  parameters.  That  is,  we 
have  a  single  peak  (or  cluster  of  data  points)  in  the  window 
region.  The  vector  p  is  the  center  point  of  this  part  of  the 
density.  The  symmetric  matrix  2  is  its  covariance  matrix.  The 
constant  C  represents  the  “probability  mass”  of  this  part  of  the 
entire  probability  distribution. 

The  windowed  density  function,  the  effective  density 
function  of  the  dataas  viewed  through  the  window,  is  w(x)f(x). 
That  is,  if  we  assign  weight  w;  =  w(xj)  to  each  data  point  xj, 
and  if  we  do  computations  with  the  weighted  x ;,  the  results  will 
be  as  if  we  were  working  with  a  sample  from  w(x)f(x). 


Assume  for  simplicity  that  a,  the  window  center,  is  0. 
Let  B  =  2~'.  It  will  be  more  convenient  to  work  with  B. 
Let  A  =  B + V.  Then,  by  doing  some  algebra,  we  find  that 
the  windowed  density  function  is 


w(x)f(x)  =  K 


lAl'* 


-i(x-A-‘Bp)'A(x-A-'Bp) 
e  2 


This  is  a  multivariate  Gaussian  function  with  “windowed 
mean”  A''Bp  and  “windowed  covariance  matrix”  A  '.  It 
follows  that  the  weighted  sample  mean  x^  is  an  estimate  of 
A‘‘Bp,  and  the  weighted  sample  covariance  matrix  is  an 
estimate  of  A  '.  The  constant  K  above  is  the  integral  of 
w(x)f(x)  over  the  entire  space.  We  will  estimate  it  by 
(l/N)Zwi,  the  average  of  the  weights. 

We  now  “degauss”  the  view  of  the  data  as  seen  through  the 
Gaussian  window;  that  is,  we  remove  the  effect  of  the  weights 
on  the  shape  of  the  data  in  the  window  region.  Since  is  an 
estimate  of  A"*,  we  can  estimate  A  by  and  we  have 


So  we  can  estimate  B  by 


^  « 
B  =  S 


-V. 


We  can  then  estimate  2  by 

S=B-'  =  (S/-V)‘, 

assuming  that  S^"’  -  V  is  positive  definite. 

Since  is  an  estimate  of  A  'Bp,  we  can  estimate  p  by 

p  =  B-'S„-’x^. 

And  since  (l/N)Swi  is  an  estimate  of  K,  we  can  also  estimate 
the  constant  C.  These  estimated  parameters  give  us  an 
estimate  of  the  shape  of  the  density  function  in  the  window 
region.  Note  that  all  of  the  computations  are  simple  matrix 
operations. 

If  we  find  a  cluster  in  a  window,  we  can  describe  its  shape 
using  the  method  of  principal  components.  See  Morrison 
(1990).  To  do  thi^s  we  find  the  eigenvalues  and  corresponding 
eigenvectors  of  2.  The  estimated  shape  of  the  cluster  is  a  p- 
dimensional  ellipsoidal  shape  centered  at  p.  The  principal 
axes  of  the  ellipsoid  are  parallel  to  the  eigenvectors.  The 
estimated  density  function  can  be  expressed  as  a  product  of  p 
univariate  Gaussian  (normal)  densities,  each  lying  along  a 
principal  axis.  The  standard  deviation  of  each  of  these  densi¬ 
ties  is  the  square  root  of  the  corresponding  eigenvalue  (all  of 
which  are  positive  in  this  case).  Thus  we  have  a  way  of 
thinking  about  the  shape  of  the  cluster  in  any  number  of 
dimensions. 
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Note  that  we  could  this  analysis  based  on  the  matrix  B, 

which  is  the  inverse  of  Z.  These  two  matrices  have  the  same 

A 

eigenvectors,  and  the  eigenvalues  of  B  are  the  reciprocals  of 
those  of  £.  It  follows  that  a  large  positive  eigenvalue  of  B 
indicates  that  the  data  points  are  tightly  concentrated  along  the 
corresponding  direction,  while  an  eigenvalue  near  0  indicates 
a  structure  that  may  extend  beyond  the  window  region.  When 
we  deal  with  more  general  structures,  we  will  analyze  Aeir 
shape  by  looking  at  the  eigenvalues  and  eigenvectors  of  B. 

The  analysis  above  also  applies  if  the  shape  of  the  density 
function  in  the  window  region  is  a  valley  or  a  saddle  point  In 
these  cases  all  or  some  of  the  eigenvalues  of  B  will  be  negative. 
A  negative  eigenvalue  indicates  that,  in  the  window  region,  the 
density  function  is  concave  upward  along  the  direction  of  the 
corresponding  eigenvector. 


4  The  general  case 

We  now  give  a  more  general  formulation  which  will 
include  the  examples  above,  and  also  extended  structures  such 
as  a  “ridge”.  We  will  assume  that  the  density  function  in  the 
window  region  can  be  approximated  by 


f(x)  =  H  e 


jx'Bx  +  r'x 


The  exponent  is  a  general  polynomial  of  degree  two  in  the 
coordinates  of  the  vector  x.  (Any  constant  term  is  absorbed  in 
H.)  The  constant  H  is  the  density  at  the  window  center 
(as.sumedtobeatO).  The  symmetric  matrix  B  may  or  may  not 
be  positive  definite,  and  it  may  or  may  not  be  non-singular.  If 
B  is  singular,  there  is  no  center  point  p  for  the  function. 

As  before,  the  windowed  density  function  w(x)f(x)  is  a 
multivariate  Gaussian  function.  We  therefore  compute  x^,  S^, 
and  (l/N)Zwi  as  before,  and  we  estimate  the  parameters  B,  r, 
and  H  based  on  these  quantities.  See  Jaeckel  (1990).  Since  in 
the  general  ca^  B  might  be  nearly  singular,  we  will  work 
directly  with  B  instead  of  inverting  it.  We  then  find  the 
eigenvalues  and  eigenvectors  of  B,  and  we  use  these  quantities 
to  describe  the  shape  of  the  estimated  density  function  in  the 
window  region.  The  method  is  analogous  to  the  method  of 
principal  components.  The  interpretation  of  the  eigenvalues 
of  B  is  the  same  as  in  the  previous  section.  As  in  principal 
components  analysis,  we  can  express  the  estimated  density 
function  as  a  product  of  p  functions  of  one  variable  each. 

We  c  m  now  handle  the  case  of  an  extended  structural 
feature,  such  as  a  "ridge”  of  data  points,  that  passes  through 
a  window  and  extends  beyond  it.  In  this  case  B  will  have  some 
eigenvalues  very  near  0;  these  eigenvalues  tell  us  that  the 
structure  extends  beyond  the  window.  Since  B  is  the  estimated 
inverse  covariance  matrix,  an  eigenvalue  near  0  indicates  that 
the  data  in  the  window  region  appear  to  have  an  essentially 
“infinite”  variance  in  the  direction  of  the  corresponding  eigen¬ 


vector.  In  the  case  of  a  ridge,  which^  is  an  essentially  one- 
dimensional  concentration  of  points,  B  will  have  one  eigen¬ 
value  very  near  U,  and  the  corresponding  eigenvector  will  be 
parallel  to  the  “center  line”,  or  crest,  of  the  ridge. 

Since  a  structure  like  this  does  not  have  a  center  point,  as 
a  cluster  does,  we  will  not  try  to  estimate  a  center  point  here. 
Instead,  we  will  estimate  the  location  of  the  center  line  of  the 
ridge.See  Jaeckel  (1990).  We  can  also  use  the  p-1  remaining 
eigenvalues  and  eigenvectors  to  estimate  the  shape  of  the 
cross-section  of  the  ridge.  In  a  p-dimensional  space,  a  ridge 
would  have  a  (p-l)-dimensional  cross-section  orthogonal  to 
the  center  line. 

If  we  find  a  structure  like  this,  we  can  then  move  the 
window  center  to  the  nearest  point  on  the  center  line  and  try 
another  window.  Then  we  can  follow  along  the  ridge  by 
moving  the  window  center  along  the  estimated  center  line.  By 
continuing  in  this  way  we  can  map  out  the  extent  and  shape  of 
the  ridge.  An  essentially  k-dimensional  structure,  or  concen¬ 
tration  of  data  points,  can  be  treated  in  a  similar  way. 

Since  the  method  is  interactive,  it  is  flexible  and  open- 
ended.  It  can  be  used  (in  principle)  in  any  number  of  dimen¬ 
sions.  Few  assumptions  are  made  about  the  data.  We  can 
search  for  structural  features  by  trying  many  different  win¬ 
dows,  and  we  can  describe  the  features  we  find.  Then  we  can 
put  together  what  we  have  found  into  an  overall  dcscnpiion  of 
the  data.  The  method  can  be  used  in  conjunction  with  other 
methods,  such  as  graphical  methods  am  automatic  clustering 
algorithms.  Note  that  with  this  method  we  can  lind  structural 
features  other  than  clusters.  Since  the  computations  arc  rela¬ 
tively  simple,  the  method  can  easily  be  implemented  on  a  small 
computer.  Any  standard  algorithms  for  inverting  a  matrix  and 
for  finding  the  eigenvalues  and  eigenvectors  of  a  symmetric 
matrix  can  be  used. 

Most  importantly,  we  can  apply  our  geometrical  intuition 
to  the  features  we  find  in  the  data,  so  that  we  can  think  about 
and  describe  the  structure  of  a  set  of  data  in  any  number  of 
dimensions. 
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Definition  of  the  Problem 

■■  Factor  analysis  is  a  frequently  used  statisti¬ 
cal  tool  for  representing  a  usually  large  num¬ 
ber  of  observable  variables  with  a  smaller  set 
of  latent  factors.  In  classical  factor  analysis, 
the  observable  variables  are  expressed  as  linear 
combinations  of  the  factors.  During  this  pro¬ 
cedure  neither  the  factors  nor  the  scores  are 
binary.  Boolean  factor  analysis  is  a  procedure 
for  the  representation  of  binary  variables  in 
terms  of  Boolean  combinations  of  binary  fac¬ 
tors. 

Suppose  that  .V  is  a  d  dimensional  random 
variable  with  binary  coordinates  and 

X  =  A  ®  K, 

where  A  is  a  fixed  {d  x  1)  matrix  with  bi¬ 
nary  coordinates  and  Y  is  an  I  dimensional 
random  vector  with  binary  coordinates  with 
I  <  d.  The  (g)  notation  means  that  we  are  us¬ 
ing  Boolean  operations  which  are  reflected  in 
the  following  tables; 


© 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

1 

0 

1 

In  our  model,  A  is  unknown,  Y  is  unknown 
and  I  <  d  means  that  the  data  comes  from 
a  smaller  dimensional  space  through  the  fixed 
matrix  A. 

Furthermore  it  is  supposed  that  there  is  a 
random  error  in  the  observations;  instead  of 
X  we  observe  .V  =  .V  +  (  where  c  is  a  d  dimen¬ 


sional  random  vector  with  independent  coor¬ 
dinates  f  1 ; . . . ;  frf.  The  conditional  distribution 
of  €i  given  .V,  is 


fl 

Pi(^\X,  =  0) 

P(e,l-V,  =  1) 

-1 

0 

Pi 

0 

1  -  Po 

1  -  Pi 

1 

Po 

0 

The  error  probability  depends  on  the  actual 
value  of  Xi  of  the  i-th  coordinate  of  X .  It  is 
supposed  that  the  error  probabilities  po  and  p\ 
are  small. 

The  aim  of  the  Boolean  factor  analysis  is  to 
recover  A  and  Y  with  the  help  of  the  given 
data  set. 

The  idea  of  Boolean  factor  analysis  at  first 
appeared  at  the  BMDP  package  (see  Dixon 
[1])  although  their  model  is  slightly  different. 
The  algorithm  developed  in  [2]  is  entirely  dif¬ 
ferent  in  one  step. 

Finding  the  Boolean  Scores  and  Loading 
Matrices 

Suppose  the  data  set  is  given  in  a  matri.x 
form;  D  =  (d,j )  is  a  (dxn)  binary  data  matrix. 
The  algorithm  seeks  a  loading  matrix  A  and  a 
scores  matrix  S  such  that  B  =  A®S  is  “close" 
to  D,  where  B  is  called  estimator  or  predictor 
of  D  Now  we  define  a  criteria  for  closeness. 

Definitions:  Positive  discrepancy  means 

that  d,j  =  0  and  6,y  =  1,  i.e.,  the  i-th  vari¬ 
able  of  the  j-th  data  is  0  which  is  predicted 
as  1.  Negative  discrepancy  means  that  d,j  =  1 
and  b,j  =  0  i.e.,  that  the  i-th  variable  of  the 
j-th  data  is  1  which  is  predicted  as  0. 
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Suppose  that  the 

cost  of  positive  discrenancv  is  Cp  >  0 
and  the 

cost  of  negative  discrepancy  is  Cn  >  0. 

The  task  is  to  find  loading  and  score  matri¬ 
ces  for  fixed  unknown  I  which  minimizes  the 
overall  cost  function: 

C  =  ((D  -  A  O  S((  = 

d  n 

Cp^^/(d,j  =  0,  bij  =  l)-h 
1=1  j=i 

d  n 

=  1,  b,j  =  0). 

i=i  j=i 

A  two  step  algorithm  developed  to  solve  the 
above  problem  is  in  paper  [2]  which  contains 
the  details.  That  version  of  the  Boolean  factor 
analysis  program  was  written  for  the  Discrete 
STatistical  ANalysis  (DISTAN)  package  spon¬ 
sored  by  the  Social  Science  Information  Center 
of  the  Hungarian  Academy  of  Sciences.  This 
program  has  another  version  with  more  fea¬ 
tures.  A  brief  description  of  the  algorithm  now 
follows. 

Step  One  searches  for  a  new  vector  of  the 
loading  matrix.  That  search  is  based  upon 
the  dependence  between  the  variables.  The 
method  developed  for  this  step  is  different  then 
the  one  used  in  DM  DP.  This  step  is  very  im¬ 
portant  because  at  the  beginning  it  is  possible 
to  incur  only  a  small  cost  if  the  loading  ma¬ 
trix  is  appropriately  chosen.  To  explain  it  in 
more  detail,  in  the  first  step  we  must  give  a 
d  dimensional  0-1  vector  as  the  loading  ma¬ 
trix  and  one  dimensional  scores  for  each  case. 
Suppose  that  both  costs  Cp  =  c„  =  1.  Con¬ 
sidering  the  nature  of  the  Boolean  operations, 
we  can  initialize  the  algorithm  with  the  load¬ 
ing  vector  having  all  its  components  equal  to 
1.  Then  we  define  the  scores  for  each  case  as 
1  if  the  case  has  more  1  then  0  or  as  0  oth¬ 
erwise.  The  cost  for  each  case,  is  the  number 
of  0-s  if  the  case  has  more  1  or  the  number  of 
I-s  otherwise.  Are  there  any  loading  vectors 


with  a  smaller  cost?  The  answer  is  yes.  The 
initial  loading  vector  defined  with  the  help  of 
the  pairwise  dependence  of  the  variables  can 
be  different  from  the  one  with  all  its  coordi¬ 
nates  equal  to  1  and  data  analysis  shows  this 
has  a  lower  cost.  A  data  set  of  796  patients 
was  analyzed.  The  rigidity  and  strengthness 
of  the  muscles  were  measured  in  different  parts 
of  the  body;  there  were  89  variables.  Table  1 
shows  part  of  the  output  of  the  Boolean  factor 
analysis  program.  Using  a  loading  vector  all  of 
whose  coordinates  equal  1  produces  a  cost  of 
17572;  if  we  use  the  random  nature  of  the  data 
set,  with  the  help  of  the  dependence  structure 
of  the  variables  the  initial  cost  is  lowered  to 
5808. 

Step  Two  consists  of  defining  and  refining 
the  scores  and  loadings.  This  is  the  so-called 
Boolean  regression  step  similar  to  the  one  used 
in  the  BMDP  8M  program.  In  this  step  for  a 
given  (d  X  k)  loading  matrix  A  and  for  a  given 
case  X,  the  algorithm  chooses  a  score  which 
minimizes  the  cost  of  misprediction  for  that 
case  examining  all  possible  2*^  scores.  Then 
the  loading  matrix  A  is  modified  in  a  similar 
fashion  for  the  given  scores  matrix. 

Example 

The  data  for  the  example  come  from  a  study 
of  muscles  of  796  subjects  with  muscle  disor¬ 
ders.  The  flexibility  and  strength  of  different 
muscles  of  the  body  were  measured  on  a  scale 
from  0  to  6.  AO  value  means  normal  mus¬ 
cle  function.  A  6  means  completely  rigid  and 
weak  muscle.  A  value  between  1-5  means  dif¬ 
ferent  levels  of  flexibility  or  strength.  45  dif¬ 
ferent  muscles  were  tested.  For  the  purpose  of 
a  Boolean  factor  analysis  the  data  was  coded 
by  0  and  1  in  the  following  way:  if  the  value 
of  the  variable  was  between  1-6,  referring  to 
abnormal  muscle  function,  we  code  1.  In  case 
of  one  muscle  the  value  of  both  variable  flex¬ 
ibility  and  strength  was  the  same;  either  0  or 
6.  This  way  we  analyzed  89  variables  for  796 
subjects. 
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Table  1  shows  an  output  of  the  Boolean  fac¬ 
tor  analysis  of  the  described  data  set.  Us¬ 
ing  9  factors  the  cost  is  only  1554.  Because 
Cp  =  c„  =  1  it  means  that  B,  the  estimator 
of  D,  of  the  coded  data  set,  is  different  from 
D  in  1554  places  out  of  70844.  Thus  the  pre¬ 
diction  error  is  only  2%.  Table  2  shows  the 
nonzero  coordinates  of  the  column  vectors  of 
the  loading  matri.x. 

The  example  shows  that  Boolean  factor 
analysis  can  be  applied  successfully  not  only 
binary  data  set.  The  final  prediction  error  is 
very  impressive  considering  the  fact  that  the 
new  codes  are  producing  larger  error. 
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MAXIMAL  DISCREPANCY- 
MAXIMA/L  COST  =  70844 


70844 


##  1  MODELLVECTORS  ## 

COST  =  5808.0000 
DISCREPANCY-  5808  PDIS- 

PREDICTION  ERROR-  0.08198 

##  2  MODELLVECTORS  ## 

COST  =  2920.0000 
DISCREPANCY-  2920  PDIS- 

PREDICTION  ERROR-  0.04122 

##  3  MODELLVECTORS  ## 

COST  =  2768.0000 
DISCREPANCY-  2768  PDIS- 

PREDICTION  ERROR-  0.03907 

##  4  MODELLVECTORS  ## 

COST  =  2708.0000 
DISCREPANCY-  2708  PDIS- 

PREDICTION  ERROR-  0.03822 

##  5  MODELLVECTORS  ## 

COST  =  2587.0000 
DISCREPANCY-  2587  PDIS- 

PREDICTION  ERROR-  0.03652 

##  6  MODELLVECTORS  ## 

COST  =  2096.0000 
DISCREPANCY-  2096  PDIS- 

PREDICTION  ERROR-  0.02959 

##  7  MODELLVECTORS  ## 

COST  =  1814.0000 
DISCREP.ANCY-  1814  PDIS- 

PREDICTION  ERROR-  0.02561 

##  8  MODELLVECTORS  ## 

COST  =  1667.0000 
DISCREPANCY-  1667  PDIS- 

PREDICTION  ERROR-  0.02353 

##  9  MODELLVECTORS  ## 

COST  -  1554.0000 
DISCREPANCY-  1554  PDIS- 

PREDICTION  ERROR-  0.02194 


5090  NDIS-  718 
COST  ERROR  -  0.08198 


1930  NDIS-  990 

COST  ERROR  -  0.04122 


1780  NDIS-  988 
COST  ERROR  -  0.03907 


1751  NDIS-  957 
COST  ERROR  =  0.03822 


1664  NDIS-  923 
COST  ERROR  =  0.03652 


1052  NDIS-  1044 
COST  ERROR  =  0.02959 


861  NDIS-  953 

COST  ERROR  -  0.02561 


813  NDIS-  854 

COST  ERROR  -  0.02353 


783  NDIS-  771 

COST  ERROR  -  0.02194 


Table  1 
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Number  of  Factors  :  9 

LOADING  MATRIX 


** 

1 

*** 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

38 

39 

40 

41 

42 

43 

44 

45 

** 

2 

*** 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 

61 

62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

76 

79 

81 

82 

84 

** 

3 

1 

74 

75 

76 

77 

78 

79 

80 

81 

82 

83 

84 

85 

86 

87 

88 

4 

*** 

46 

49 

51 

52 

58 

59 

60 

63 

65 

66 

75 

77 

78 

80 

83 

85 

86 

87 

88 

89 

** 

5 

*** 

1 

89 

■k* 

6 

*** 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

*  * 

7 

*** 

1 

2 

3 

4 

6 

7 

8 

9 

10 

11 

12 

13 

16 

17 

18 

20 

21 

22 

23 

24 

27 

32 

35 

37 

40 

43 

45 

** 

8 

*** 

6 

13 

20 

27 

30 

32 

38 

** 

9 

*** 

5 

8 

13 

14 

15 

19 

22 

27 

28 

29 

39 

Table  2 
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Abstract 

^The  location  models,  which  can  be  used  in  discriminant 
problems  when  the  data  contain  both  categorical  and 
continuous  variables,  requires  separate  continuous  variables 
means  to  be  fitted  f(v  each  possible  pattern  of  categorical 
responses.  Several  forms  of  similarity  measure  are  reviewed. 
The  problem  of  estimating  similarity  when  the  continuous 
variables  of  location  models  are  multivariate  normal 
distributions  with  equal  covariance  matrices  across  the 
discrete  states  has  previously  been  studied.  In  this  work,  the 
assumption  of  equ^  covariance  matrices  is  relaxed.  The 
explicit  form  of  general  similarity  measure  between  two 
location  models  is  drived  assuming  general  multivariate 
normal  distributions.  Estimation  of  parameters  in  this 
similarity  measure  is  discussed. . 


(d)  KuUback  &.  Leibler's  inf(»mation  measure  (1951); 

OO 

r  fi(x) 

( Jii  ,  112  )  =  J  fi  (X)  log  [^^]dx 

-  OQ 

OO 

A2  ( rti  ,  JC2  )  =  J  f2(x)  log  [-j^^]dx 

-  OO 

(e)  Chemoff  measure  (1952): 

OO 

p(ni,Jt2)=  J  [fi(x)]“[f2(x)}^-“dx  ,  0<a<l 


1  Measures  of  distance  and 
similarity  measures 

Consider  two  populations  nj  and  712  and  a  vector-valued 
continuous  random  variable  X  defined  over  a  space  R  such 
that  F2(x)  and  F2(x)  are  the  distribution  functions  of  X  in 
iti  and  7t2  while  fi(x)  and  f2(x)  are  the  corresponding 
density  functions  with  respect  to  a  suitable  measure.  For  a 
discrete  random  variable  X,  fi(x)  and  f2(x)  will  be  treated  as 
the  corresponding  probability  mass  functions. 


(0  Matusita’s  distance  (1955): 

OO 

II  Fl  .  F2  "  r  =  I  J  t  '  dx  1'/^ 

•  OO 

Ibis  is  essentially  the  same  as  Hellinger  distance. 

If  the  affinity  between  Fj  and  F2  is 

OO 

P(Fi,F2)=  j  Ifi(x)f2(x)]l/2 
-  OO 

then  llFj.F2ll2^  =  2(l-p(Fi,F2)) 


The  following  distance  measures  or  similarity  measures  have 
been  extensively  studied; 

(a)  Hellinger  distance  (1907); 

OO 

Pp(ni.it2)={  J([fi(x)]l/P- (f2(x)]l/P|Pdx  )I/P 

-  OO 

(b)  Bhattacharyya  distance  measure  (1946): 

6  ( Jti  ,  712  )  =  *  P(  '*'2  ) 

OO 

where  p(7Ci,7r2)=  J[  fi(x)  f2(x)  dx 

>  OO 

(c)  Jefifeys  divergence  measure  (1946): 

OO 

J  ( 7t,  ,  7t2  )  =  f  (f2(x)  -  fi(x))  log  (-fi;}!  Idx 


(g)  Morisita's  similarity  measure  (1959): 


2  I  fi(x)  f2(x)  dx 


(h)  MacArthur-Levins  similarity  measure  (1967); 

OO 

J  fi(x)  fj(x)  dx 

a;;  = — — -  for  ij  =  1,2,  i  J 

*J  00  w 

f  dx 
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(i)  Sibson's  information  radius  (1969); 

1  fj(x)  f2(x) 

A  (ni  ,7C2)  =  f  J{fi(x)  log[-^]  +  f2(x)  log[^]  }dx 

•  OO 

where  f(x)  =  |  ( f j(x)  +  f2(x) ) . 

(j)  Pianka's  measure  of  overlap  (1974): 


a  =  \  ajj  ttji 


for  i,j=l,2,  i?tj  . 


Jfi(x)fj(x) 


•  OO  -  OO 

(k)  Good  and  Smith  (1987)  General  measures  of  similarity: 

OO 

I(r.s)=  J[fi(x))‘'[f2(x)]®dx 


and  two  alternatives: 


J  (  r,  s  )  =  2 


I(2r,0)  +  I(0.2s) 


and  G(  r ,  s )  = 


V  I(2r, 0)1(0, 2s) 


The  parameters  r  and  s  are  weighting  parameters  and  are 

usually  given  the  values  2  o'"  ^  •  Note:  Some  of  the  above 

measures  will  be  special  cases  of  these  general  similarity 
measures.  For  example,  Bhattacharyya  distance  measure 

0  ( Tt]  ,  7:2  )  =  cos’*  I(  2  ,  2)’  Chemoff  measure  is  p(7C|,7t2) 

=  1(  a  ,  1-  a  ),  Matusita's  affinity  measure  is  p  (  Fj  ,  F2  )  = 

1( ^  ,  ^ ),  Morisita's  similarity  measure  A.  =  J(l,l) , 

MacArlhur-Levins  similarity  measure  is  1(2  q)  °*^ 

J  2^  ,  and  Pianka's  measure  of  overlap  is  a  =  G(l,l). 

The  above  measures  can  be  applied  to  the  populations  with 
discrete  distributions  and  probability  mas  functions.  In  this 
case,  summation  over  the  possible  states  will  be  used 
instead  of  integration. 

2  Location  Models 


I 

Suppose  that  p  continuous  or  quantitative  variables  Y  = 

(Y I . Yp  )  and  q  di.scrctc  or  qualitative  variables  X  = 

(X  ] .  Xq  )  are  measured  on  each  individual,  and  that 


individuals  are  drawn  from  2  populations  rti,  712.  The 
location  model  was  introduced  by  Olkin  and  Tate  (1961)  to 
cope  with  the  mixed  variables,  and  this  model  has 
subsequently  be  applied  to  the  two-sample  case  for  tests  ol 
hypotheses,  for  discriminant  analysis,  for  clustering,  for 
classification,  and  for  medical  diagnostics. 

The  q  discrete  variables  (may  be  binary  or  categorical)  arc 
assumed  to  define  a  multinomial  vector  Z  containing  d 
possible  states,  eg:  for  b  binary  variables  and  k  three-state 
b  k 

categorical  variables,  d  =  2  3  . 

Thus  each  distinct  pattern  of  X  define  a  multinomial  cell 
T 

uniquely.  X  =(X], .  Xq  )  can  be  replaced  by  a  random 

T 

vector  Z  =  (Zj, .  2j  )  and  each  Zj  takes  the  value  one 

for  a  particular  state  of  the  original  X’s  and  zero  cl.sevvhcre. 
The  probability  of  observing  slate  m  in  population  p-  is 

assumed  to  be  p.^  (i=I,2;  m=l,...,d).  Then  conditionally  on 

Z  falling  in  slate  m,  the  p  continuous  variables  Y  arc 
assumed  to  follow  a  multivariate  normal  distribution  with 

mean  and  dispersion  matrix  in  population  Ttj 

(i=l,2;  m=l,...,d).  The  only  assumption  embodied  in  this 
model  is  normality,  and  this  is  impo.scd  in  most  parameu  ic 
techniques. 


The  location  model  can  be  defined  as 


d  .. 

f(z,Pii7t.)=  n  p'r 

w’hcrc  Pj 

=  (Pil-Pi2 . Put 

m=l 

d 

d 

p.  =  E(z  In), 

*^im  m  i  ^ 

Pim  =  ' 

and  f/"*\y)  =  f  (y  1  Tt; ,  =1,  =0,  m  k  =  1,2 . d) 

The  proposed  model  admiLs  the  following  .special  cases  of 
interest: 

Lj;  the  conditional  dispersion  mauix  is  constant  for  all  state 

in  each  population,  that  is  =  Zj  (i=l,2;  m=l . d); 

Homogeneous  variance-covariance  matrices  acro.ss  states 
within  population. 

L2;  the  conditional  dispersion  matrix  is  constant  for  all  state 

in  each  population,  that  is  =  Z^”*^  (i=l,2;  m=l . dt: 

Homogeneous  variance-covariance  matrices  between 
populations  with  respect  to  stales. 

L3:  the  conditional  dispersion  matrix  is  constant  for  all  state 

in  each  population,  that  is  Z,*'"^  =  Z  (i=l,2;  m=l . d)  . 

Homogeneous  variance-covariance  matrices  across  states  and 
populations. 
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3  General  similarity  measures 
with  mixed  variables: 
Population  case 

Let  us  derive  the  general  similarity  measures  under  the  most 
relaxed  condititHis  between  two  multivariate  normal 
populations  -  different  means  and  different  dispersion 
matrices. 

expt-  ^j-M.2)'’^(sSi+rE2)'VrlA2)] 

I(2r,0)  =  1 2nZi  (2r)'P/^ 

and  I(0.2s)  =  i  2n2:2 (2s)'P^ 

J(r/)  = 

2exp[(-r/2)  (4i-p2)V^l->-»'^' 

exp[- j(Hi  -112)^  (sZi+rZ2)'^lii  -112)] 
~  (4rs)IZiZ2r^^'^  lsZi+rZ2 


P=  f  ^  ^ exp  [>(ati-H2/(Zi+Z2)- VrP2)l 
I  ^Zj+Z2) 

Now  consider  the  mixed  variable  case: 

The  joint  density  of  state  m  of  the  discrete  variables  and 
values  of  yi.—.yp  for  the  continuous  variables  is  given  by 
the  product  of  the  conditional  and  marginal  densities  as  pj,^ 

m=l  -  00 

=  i  f  <  p/m  P2m')  /  { (y)j'  { ^y  ] 

in=l  -  00 

^  r  s 

=  Z  {  Pi  P'1  ^  (riS)  ) 

*^2ni  m  ^  ^ 

in=l 

where  (r,s)  is  the  general  measure  of  similarity  I(r,s) 
between  N  ( Z;^"'^ )  and  N  ( )  and  can 
be  evaluated  as  the  above.  Krzanowski  (1983)  discussed  the 
case  with  r  =  s  =  1/2  for  k  populations. 

Moreover,  two  alternatives  are 


exp[  -^(Pi-P2)^  (£1+2:2)’^  (Pl-lt2)  1 
G(f»r)  *  1  /yi  1  1  /'I 

iZi  Z2  r"  I  |(Zi  +  Z2)  I  ' 

For  r  =  s  =  j,  and  when  Zj  =  Z2  =  Z , 

p  =  I(")  =  exp[  - 1  ( Pi-P2)^Z‘ VrP2)J  and 

X  =  J(l.l)  =  exp[-i(pi  -P2)'^Z-^  (Pl-P2)l 


=  »12  =  <«21=|:g-t»  =  G(l.l). 

These  are  the  exponential  forms  of  certain  functions  of 

T  1 

Mahalabonis  generalized  distance  (P]-P2)  Z  (P]'P2). 


Krzanowski  (1983)  derived  the  following  for  Zj  ^  Z2 , 

2 


p  =  2  I  Zj  1'^'*  I Z2  1 1  +  Zi  Z2'‘ 


®*Pl  (1+Xj) 


Here  kj  are  the  eigenvalues  of  Z2  Zj  and  Vjj ,  V2j 
(j=l  ,...,p)  are  the  coordinates  of  the  population  means  in  the 
transformed  space. 


JL(ris)  =  2 


Gl(  r .  s )  = 


iL(riS) 

lL(2r.O)  +  Il(0.2s) 

iL(r.s) 

V  lL(2r.0)lL(0,2s) 


and 


Most  of  the  measures  discussed  in  section  1  will  be  special 
cases  of  the  above  general  similarity  measures  when  they  arc 
applied  to  the  location  models. 


4  General  similarity  measures 
with  mixed  variables: 
Sample  case 


In  practice,  the  general  similarity  measures  between  tw  o 
groups  of  sample  data  will  be  evaluated.  First,  we  can  adopt 
the  procedures  of  Daudin  (1986)  or  Krusinska  (1989)  to 
select  the  variables  which  will  construct  the  location 
models;  Daudin's  procedure  is  based  on  Akaike’s  criterion 
whereas  Krusinska's  procedure  is  based  on  the  multivariate 
discriminatory  measure  similar  to  the  distance  measure.  To 
obtain  the  sample  estimates  of  general  similarity  measures, 
the  simplest  way  is  to  treat  the  data  in  either  group  as  a 
sample  from  the  corresponding  population  rt.  and  to  replace 


Lu,  Smith  and  Good  (1989)  derived 
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all  parameter  values  by  their  sample  estimates  (maximum 
likelihood  estimates  in  terms  of  conditional  likelihood  on 
each  state),  and  this  may  be  called  PLUGALL.  As  the 
parameters  r  and  s  in  the  general  similarity  measures, 

researcher  will  select  the  appropriate  values  ( ^  or  1  )  to 

meet  the  forms  of  similarity  measure  in  the  applications. 
For  sufficiently  large  samples,  the  sample  estimates  of 
general  similarity  measures  can  be  obtained  for  models  Lj, 
L2  and  L3.  However,  it  would  generally  make  sense  to  pool 
across  those  categories  which  have  relatively  few 
observation;  that  is,  L3  is  the  common  model  may  be 
encountered.  The  task  remains  is  to  evaluate  the  statistical 
properties  of  the  estimators  of  various  general  similarity 
measures.  The  mathematical  difficulties  in  deriving  the 
properties  of  the  estimators  are  formidable,  and  consequently 
we  will  evaluate  the  properties  by  the  resampling  methods  - 
jackknife  and  bootstrap.  These  results  will  be  discussed 
elsewhere. 
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Abstract 

In  theory,  it  is  difficult  to  define  a  hash  function  which  is 
capable  of  creating  random  data  from  nonrandom  data.  This 
paper  addresses  the  randomization  properties  of  an 
extremely  fast,  compact  hash  function.  The  MD4  message 
digest  algorithm  produces  a  128-bit  output  or  "message 
digest"  from  an  arbitrarily-long  input  string  of  bits.  The 
results  of  a  variety  of  empirical  tests  which  were  conducted 
to  detect  possible  statistical  defects  in  the  algorithm  are 
presented.  . 

This  paper  presents  the  results  of  a  statistical  analysis  of 
the  randomization  properties  of  the  MD4  Algorithm  [7]. 
The  MD4  message  digest  algorithm  is  a  fast,  compact  hash 
function  which  maps  an  arbitrarily-long  string  of  bits  onto 
a  128-bit  quantity.  For  a  complete  description  of  the 
algorithm,  the  reader  is  referred  to  Rivest  [7].  The 
investigation  of  MD4  consisted  of  a  series  of  six  empirical 
tests  in  which  a  large  number  of  128-bit  outputs  was 
generated  and  then  examined  for  randomness,  or  the  lack 
thereof.  The  results  of  these  tests  are  as  follows. 

The  first  test  conducted  was  a  byte  parity  test.  The 
appropriate  hypotheses  for  this  Chi-Square  test  are  presented 
as  follows: 

Hq:  Odd/Even  parity  of  bytes  are  equally  likely. 

Hj:  Odd/Even  parity  of  bytes  are  not  equally  likely. 


*  MD4  is  the  product  of  Ron  Rivest,  MIT  Laboratory  for 
Computer  Science,  1990. 


From  a  total  of  one  million  iterated  applications  of  MD4, 
each  of  the  16  byte  positions  of  the  one  million  outputs  was 
examined  for  parity.  The  results  of  this  test  are  shown 
Table  1 .  It  is  apparent  that  the  null  hypothesis  cannot 
rejected,  indicating  that  each  byte  is  equally  likely  to  be  odd 
or  even.  In  fact,  the  extremely  high  P-value  (.9813)  might 
lend  statistical  credence  to  the  algorithm's  "purposeful 
smashing  of  bytes. " 


Byte 

Position 

Actual 

Expected 

Chi-Square 

Contribution 

1 

500979 

500000 

1.916882 

2 

499412 

500000 

.691488 

3 

499985 

500000 

.000450 

4 

500510 

500000 

.520200 

5 

500513 

500000 

.526338 

6 

499849 

500000 

.045602 

7 

499780 

500000 

.096800 

8 

499808 

500000 

.073728 

9 

500624 

500000 

.778752 

10 

499776 

500000 

. 100352 

11 

499787 

500000 

.090738 

12 

499922 

500000 

.012168 

13 

500242 

500000 

.117128 

14 

499414 

500000 

.686792 

15 

499939 

500000 

.007442 

16 

500347 

500000 

.240818 

Total  5.905678 
P-value  .9813 


Table  1.  Byte  Parity  Test 


.a 
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A  second  test  conducted  is  a  check  for  uniformity  in  the 
bivariate  distribution  of  byte  position  versus  byte  value. 
The  hypotheses  tested  are  as  follows: 

Hq:  Bivariate  distribution  of  byte  position  vs  byte  value 
is  uniform. 

H):  Bivariate  distribution  is  not  uniform. 

Three  million  iterated  applications  of  MD4  were  performed, 
and  the  results  of  examining  the  decimal  integer  value  of 
each  byte  in  each  of  the  three  million  outputs  are  shown  in 
Table  2. 


Byte  Position 


1 

2 

.  16 

0 

11645 

11591  . 

.  1 1678 

1 

11722 

1 1658  . 

.  1 1780 

V  . 

a 

1 

• 

u 

. 

e 

255 

11832 

11823  . 

.  11483 

Bivariate  Frequency  Distribution 

E(X)  = 

11718.75 

X^O  =  3895.87 

Min  = 

11354 

df  =  4095 

Max  = 

12099 

P  =  .54 

Table  2.  Uniformity  of  Byte  Position  vs  Byte  Value 

The  results  indicate  that  the  distribution  is  indeed  uniform. 
One  can  also  conclude  independence  between  position  and 
value.  That  is,  given  a  particular  byte  position,  the  byte 
value  is  equally  likely  to  be  any  of  the  256  possible  values. 
Similarly,  given  a  particular  value,  it  is  equally  likely  to 
occur  in  any  of  the  16  byte  positions. 

A  third  frequency  test  was  then  conducted,  this  time  at 
the  bit  level.  The  hypothe.ses  for  this  test  are  expressed  as 
follows: 

Hq:  The  distribution  of  Ts  across  all  128  bit  positions 
is  uniform. 

Hj:  This  distribution  is  not  uniform. 

Another  three  million  outputs  from  MD4  were  generated 
and  each  of  the  bit  positions  examined  to  determine  the 


frequency  of  I's  in  each  position.  Table  3  provides  the 
results  of  this  test.  These  results  again  indicate  uniformity 
across  the  128  bit  positions,  i.e.,  each  bit  is  equally  likely 
to  be  a  0  or  1 . 


Bit 

Position  Actual 

Expected 

Chi-Square 

Contribution 

1 

1,499,496 

1,500,000 

.169 

2 

1,500,769 

1,500,000 

.394 

3 

1,501,119 

1,500,000 

.835 

4 

1,500,256 

1,500,000 

.037 

5 

1,500,256 

1,500,000 

.044 

126 

1,499,975 

1,500,000 

.000 

127 

1,499,466 

1,500,000 

.190 

128 

1,500,624 

1,500,000 

.260 

Min  = 

1,498,191 

X  0 

=  70.657 

Max  = 

1,502,403 

df 

=  127 

P 

==  .85 

Table  3.  Frequency  Test  for  Bit  Positions 


The  fourth  test,  a  gap  test,  examined  another  set  of  one 
million  outputs  from  MD4.  Each  output  was  scanned  for 
the  number  of  O's  between  successive  1  's.  For  example,  the 
string  "10010110001"  has  gaps  of  2,  1,  0,  and  3, 
respectively.  Table  4  shows  the  total  number  of  observed 
gaps  for  gaps  of  size  24  or  less.  No  gaps  of  size  25  or 
larger  were  encountered.  The  successive  halving  of  the 
number  of  observed  gaps  for  an  incremental  gap  size  of  1  is 
what  we  would  expect  to  see  if  the  probability  of  a  1  or  0 
in  each  bit  position  is  .5. 


Gap  Size 

Observed  tf 

Gap  Size 

Observed  § 

0 

32,263,081 

13 

3,583 

1 

16,247,658 

14 

1,793 

2 

8,063,542 

15 

881 

3 

4,001,042 

16 

421 

4 

1,984,905 

17 

209 

5 

983,004 

18 

121 

6 

488,068 

19 

59 

7 

241,362 

20 

27 

8 

120,025 

21 

18 

9 

59,219 

22 

5 

10 

29,843 

23 

3 

11 

14,471 

24 

2 

12 

7,337 

Table  4.  Gap  Test 
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The  fifth  test  conducted  was  one  in  which  the  difference 
(in  absolute  value)  between  the  number  of  I's  and  the 
number  of  O's  occurring  in  each  of  one  million  outputs  was 
noted.  The  observed  (actual)  frequencies,  as  well  as 
expected  frequencies  (under  the  assumption  that  a  0  or  1  in 
any  position  is  equally  likely),  are  shown  in  Table  S.  These 
results  clearly  support  the  assumption. 


Difference 

Actual 

Expected 

Chi-Square 

0 

70331 

70386 

.043 

2 

138539 

138606 

.032 

4 

132566 

132306 

.511 

6 

122122 

122433 

.790 

8 

109900 

109829 

.046 

10 

94974 

95504 

2.941 

12 

80948 

80496 

2.538 

14 

65734 

65757 

.008 

16 

52053 

52058 

.000 

18 

39905 

39935 

.023 

20 

29959 

29681 

2.604 

22 

21426 

21370 

.147 

24 

14928 

14903 

.042 

26 

10117 

10064 

.219 

28 

6508 

6581 

.810 

30 

4205 

4165 

.384 

32 

2532 

2551 

.142 

34 

1480 

1512 

.677 

36 

870 

866 

.018 

38 

490 

480 

.208 

40 

277 

257 

1.556 

42 

115 

133 

2.436 

44 

67 

67 

.000 

46 

32 

32 

.000 

48 

12 

15 

.600 

50 

4 

7 

1.286 

52 

5 

5.833 

.119 

54 

0 

2.436 

2.436 

56 

1 

.980 

.000 

X^O  =  18-603 

df  =  27 

P  =  .884 

Table  5.  Differences  Between  #  of  1  's  and  #  of  O’s 


A  final  test  was  conducted  to  examine  the  avalanche 
effect  of  MD4.  A  series  of  30,000  comparisons  was  made, 
where  each  comparison  compared  two  outputs  of  MD4. 
The  two  outputs  compared  were  the  outputs  corresponding 
to  two  ’almost  identical'  inputs  to  MD4.  That  is,  if  A  is  a 


32,000-byte  (256,000-bit)  string  and  A'  is  also  a  32,000- 
byte  string  which  differs  from  A  in  only  1  bit  position,  we 
then  compared  MD4(A)  and  MD4(A'),  each  being  a  128-bit 
string.  We  looked  at  the  Hamming  distance  between 
MD4(A)  and  MD4(A'),  i.e.,  the  number  of  bit  position 
changes  that  occurred  between  MD4(A)  and  MD4(A').  The 
hypotheses  tested  can  be  described  as  follows: 

Hq:  The  distribution  of  Hamming  distances  is  binomial 
with  n=  128  and  p=  .5. 

Hj:  This  distribution  is  not  binomial  with  n=  128  and 
p=.5. 

The  frequency  distribution  of  Hamming  distances  that 
occurred  among  the  30,000  comparisons  is  shown  in  the 
histogram  of  Figure  1 . 


Figure  1.  Avalanche  Effect:  Hamming  Distance 

This  figure  suggests  that,  on  the  average,  about  half  (64)  of 
the  bits  will  change.  While  this  figure  gives  us  an 
indication  of  how  many  bits  will  change.  Table  6  shows  us 
that,  of  the  bits  that  do  change,  each  of  the  bit  positions 
tends  to  contribute  equally  to  the  number  of  changes. 
Clearly,  the  avalanche  effect  demonstrated  is  one  in  which 
the  outputs  for  two  "almost  identical’  inputs  appear  to  be  as 
random  as  any  other  two  randomly  chosen  128-bit  strings. 

The  results  shown  here  indicate  that  MD4  is  a  byte 
smasher  extraordinaire.  These  random  properties  of  MD4, 
together  with  its  speed  and  compactness,  make  it  a 
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potentially  valuable  tool  for  a  variety  of  applications, 
including  virus  detection  and  compressing  large  files  prior 
to  signing  them  with  a  public-key  algorithm  such  as  RSA. 


Bit  Position 

Actual 

Expected 

Chi-Square 

1 

14,995 

15,000 

.002 

2 

14,997 

15,000 

.001 

3 

15,123 

15,000 

1.009 

4 

14,956 

15,000 

.129 

5 

14,915 

15,000 

.482 

126 

14,855 

15,000 

1.402 

127 

15,029 

15,000 

.056 

128 

15,092 

15,000 

.564 

Min 

=  14,811 

X  0 

=  65.148 

Max 

=  15,294 

df 

=  127 

P 

=  .89 

Table  6.  Avalanche  Effect;  How  often  each  of  the  128  bit 
positions  changed  in  the  30,000  comparisons 
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Abstract’- 

We  simulate  sevetsJ  variants  of  a  class  of  queueing  net- 
-works  -  corresponding  to  different  system  parameter  val¬ 
ues  or  operating  policies  -  simultaneously.  One  clock 
mechanism  is  used  to  drive  all  the  variants.  This  clock 
synchroidzes  the  system  trajectories  such  that  the  “same 
event”  takes  place  at  the  “same  time”  at  all  systems. 
This  synchronization  is  the  basis  of  the  massively  par¬ 
allel  algorithms  we  develop.  Implementation  of  the  al¬ 
gorithms  on  the  massively  parallel  Connection  Machine 
and  the  implications  of  the  approach  for  performance 
optimization  is  discussed. 

1  Introduction 

There  is  an  inherent  partial  parallelism  in  networks  of 
queues.  Often  each  server  operates  as  an  independent 
entity  as  long  as  customers  are  present  to  be  served.  The 
effect  of  other  servers  is  experienced  through  idle  peri¬ 
ods  -  where  no  customer  is  present  -  or  blocked  periods 
-  where  no  space  is  available  for  a  served  customer  (to 
illustrate  we  are  considering  a  simple  scenario).  While 
the  status  of  the  servers  (busy,  idle,  blocked)  remains 
unchanged,  they  can  be  simulated  independently  and 
in  parallel.  Most  parallel  algorithms  for  queueing  sim¬ 
ulation  use  this  partial  pareillelism  for  simulating  one 
“large”  network  (see[4]). 

In  contrast,  we  consider  the  simulation  of  a  “large” 
number  of  variants  of  a  “nominal”  network  that  differ, 
for  example,  in  their  routing  schemes,  bvffer  configura¬ 
tions,  service  or  arrival  rates,  or  the  number  of  customers 
in  the  system.  Obviously  there  is  a  total  parallelism 
among  the  variants.  More  importantly,  we  simulate  each 
variant  as  a  network  of  autonomous  servers  (i.e.  servers 

’The work  in  tbit  paper  wat  partially  supported  by  the  National 
Science  Foundation  under  Grant  DDM-8914277. 


that  determine  the  “departure  times”  of  customers  inde¬ 
pendently  of  the  presence  or  absence  of  customers);  the 
same  autonomous  servers  can  be  shared  between  all  the 
variants.  We  call  this  approach  the  Standard  Clock  (SC) 
technique  [3,  6,  7]  since  a  single  simulation  clock  mecha¬ 
nism  (that  may  be  standardized)  is  defined  which  drives 
all  the  variants  simultaneously.  This  clock  synchronizes 
the  system  trajectories  such  that  the  “same  event”  takes 
place  at  the  “same  time”  at  all  systems.  The  obtained 
synchronization  is  the  basis  of  the  algorithms  we  develop 
for  the  implementation  on  the  massively  paraUel  Con¬ 
nection  Machine  (CM).  This  approach  is  applicable  to 
queueing  networks  that  can  be  modeled  as  Generalized 
Semi-Markov  Processes  (GSMP)  with  bounded  hazard 
rate  event  life  times.  For  networks  that  can  be  modeled 
as  continuous  time  uniformizable  Markov  chains,  SC  is 
based  on  the  well  known  uniformization  procedure. 

An  important  feature  of  this  approach  is  the  concur¬ 
rent  evaluation  of  the  performance  of  the  network  at  very 
large  numbers  of  parameter  values  or  operating  policies. 
We  believe  this  feature  opens  up  new  possibilities  for 
performance  modeling  and  optimization.  As  a  first  step 
we  consider  a  global  random  search  for  performance  op¬ 
timization  of  a  queueing  network. 

Section  2  defines  our  model  of  a  single  queueing  net¬ 
work;  a  parameterization  of  the  model  is  considered  in 
section  3;  the  Standard  Clock  algorithm  and  its  mas¬ 
sively  parallel  implementation  is  given  in  section  4,  and 
in  section  5  we  consider  solving  a  stochastic  optimization 
problem  via  massively  parallel  simulation. 

2  Model  :  systems  driven  by  marked  Poisson 
processes 

Let  (t,  e)  =  {(7’n,Cn);n  >  0}  be  a  marked  Poisson  pro¬ 
cess  where  {xn;  n  >  0}  is  the  sequence  of  arrival  instances 
of  a  Poisson  process  N,  and  {en',n  >  0}  is  an  I.I.D.  se¬ 
quence  of  discrete  random  variables,  independent  of  the 
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Poisson  process  N,  such  that  c„  e  E,  where  E  is  a  finite 
set  called  the  set  of  events. 

Let  S,  a  denumerable  set,  be  the  set  of  “physical” 
states  of  the  system.  If  upon  the  occurence  of  an  event 
e  6  E,  the  state  of  the  system  is  *  e  S,  then  the  next 
state  of  the  system  *'  G  S  is  determined  via  a  given  state 
transition  rule; 

x'  =  f{^,e,W)  (1) 

VF’  is  a  random  variable  used  to  model  probabilistic  tran¬ 
sitions. 

Let  .i^(O)  be  the  random  variable  of  the  initial  state. 
The  sequence  of  states  {X{n);n  >  1}  is  defined  recur¬ 
sively  by  Jf(n)  =  f(X{n  —  l),Cn,Wn)  and  the  process 
X  =  {.X’(t);  t  >  0}  is  defined  as  follows: 

OO 

-^(t)  =  X{n)I{Tn  <t<  r„+i}  for  t  >  0  (2) 

n=0 

This  model  is  quite  versatile:  open  and  closed  networks 
of  queues  with  multiple  classes  of  customers,  Markovian 
routing,  hnite  and  infinite  buffer  spaces,  and  a  variety  of 
service  disciplines  can  be  modeled  as  such.  Networks 
with  exponential  service  times  and  inter-arrival  times 
provide  the  most  staightforward  examples  but  networks 
with  phase-type  service  times  and  inter-arrival  times  can 
be  modeled  as  well  by  considering  a  more  intricate  state 
space. 

To  illustrate  we  consider  a  simple  example: 

Example  2.1  :  Consider  a  tandem  network  of  K  expo¬ 
nential  servers  with  rates  fii, . . . ,  fix  respectively.  There 
are  Bi  buffers  between  server  i  and  server  (*  -|-  1)  (*’  = 
—  1).  We  assume  that  there  are  no  spaces  at 
the  servers.  There  is  an  infinite  supply  of  parts  at  server 
1  and  infinite  space  for  finished  parts  after  server  K. 
Server  t  begins  processing  a  part  only  if  the  immediate 
down  stream  buffer,  i.e.  B,,  is  not  full  (the  so-called 
communication  blocking).  In  this  case: 

5  =  {*  =  (xi,...XK-iy,0<  Xi  <  Bi} 

E  —  {di, . .  .dx}  (di  =  departure  from  server  i) 

r  =  arrivtil  instances  of  a  Poisson  process  with  rate  A  = 

Ml  +■■  MX- 

Prob(c„  =  di)  -  Mi/N 

Let  rui  be  a  (E  —  1)  dimensional  vector  with  *th  entry 
equal  to  —1,  (i  -f-  l)the  entry  equal  to  1,  and  all  other 
entries  equal  to  0  (1  <  »  <  —  1).  Let  mx-i  a  vector 

with  (Jf  —  l)th  entry  equal  to  —1  and  all  other  entries 
equal  to  0,  then 

flx,di,  W)  =  (  ^  if  >  0,  <  Bi^i 

’  (a  otherwise 

(for  di  only  xi  <  Bi,  and  for  dx  only  xx-i  >  0  is 
required.) 


3  A  parametric  family  of  systems  driven  by 
the  same  marked  Poisson  process 

To  consider  several  variants  of  a  “nominal  network”  we 
parameterize  the  system  with  respect  to  a  parameter  of 
interest.  The  parameterization  may  be  with  respect  to 
the  number  of  buffers,  buffer  configurations,  routing  pro¬ 
portions,  number  of  customers,  control  policies,  service 
and  inter-arrival  rates  or  any  combinations  of  the  above. 
The  parameterization  of  a  model  of  the  system  can  be  ac¬ 
complished  through  the  state  transition  function  /  while 
leaving  the  marked  Poisson  process  (r,  e)  unchanged. 

Example  3.1  :  Consider  the  parameterization  of  ex¬ 
ample  2.1  through  buffer  configurations  such  that 
Eill'  Bi  =  C. 

Let  B  =  {(Bi, . . . ,  Bk-i);  Bi  >  1,  Bi  =  C}.  For 

each  6  6  B  let  S],  be  the  state  space  corresponding  to 
configuration  6,  and  let  E,  r,  and  e  be  as  defined  in 
example  2.1.  The  state  transition  rules  for  each  config¬ 
uration  b  are  defined  by 

/‘(*‘,  di,  W)  =  ( 

(  »  otherwise 

(for  di  only  x\  <  B\,  and  for  dx  only  >  0  is 

required.) 

Note  that  the  same  (t,  e)  (model  of  the  simulation  clock 
mechanism)  is  used  for  all  fc  G  B.  The  next  section 
describes  algorithms  for  using  one  clock  mechanism  to 
drive  many  systems  simultaneously. 

4  standard  Clock  Algorithm  and  Massively 
Parallel  Implementation 

Assume  M  variants  of  a  “nominal  network”  correspond¬ 
ing  to  M  distinct  parameter  values  or  operating  poli¬ 
cies  are  given.  Assume  further  that  the  nominal  system 
can  be  modeled  by  the  model  described  in  section  2  and 
that  the  variants  are  parameterized  through  the  state 
transition  rules  /  as  described  in  section  3.  Let  (t,  c)  be 
the  common  marked  Poisson  process  and  the 

state  transition  rules  associated  with  variants  1, . . . ,  M , 
respectively.  The  simulation  algorithm  consists  of  two 
parts;  algorithm  A  that  simulates  the  clock  mechanism 
(generates  samples  of  marked  Poisson  process  (t,  c))  and 
algorithm  B  that  describes  the  simultaneous  updating  of 
the  system  states  upon  occurrence  of  events. 

Let  E  —  {ci,...,ex-}  be  the  set  of  events.  We  use 
the  Alias  method  to  generate  samples  of  e„.  To  use  this 
method  it  is  necessary  to  initially  generate  two  K  di¬ 
mensional  vectors  R  and  A.  We  refer  the  reader  to  [2] 
for  the  algorithm  to  generate  these  vectors  and  assume 
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here  that  R  and  A  are  generated.  Then: 

Algorithm  A:  determining  Tn^-i  and  Cn+i 

1.  generate  tn+i,  a  sample  of  an  exponential  r.v.  with 
rate  1. 

Set  T„+i  =  r„  +  t„+i/A. 

2.  generate  Un+i,  a  sample  of  a  unifotm(0, 1)  r.v.  I7n+i 

let  i  =  +  1  ([*]  denotes  the  integer  part  of 

*)■ 

3.  generate -Cn+i,  a  sample  of  a  uniform(0, 1)  r.v.  Ki+i 
if  «n+i  <  -Rff]  then  Cn+i  =  Ci 

else  =  A[i]. 

This  clock  mechanism  is  simple  and  very  efficient.  In 
fact,  except  for  the  generation  of  vectors  R  and  A,  that 
can  be  accomplished  in  0{K)  and  is  performed  only  once 
at  the  beginning  of  the  simulation,  the  execution  of  the 
clock  mechanism  is  essentially  independent  of  K,  the 
number  of  events  in  the  system. 

Let  be  the  state  of  the  variant  j  at  time  Tn 
(j  =  1, . . . ,  M).  Then: 

Algorithm  B:  updating  the  states  of  the  systems  at 
■^n+l 

(Assume  that  e„+i  =  a) 

1.  generate  lUn+ii  a  sample  of  a  uniform(0,l)  r.v. 
Wn+l. 

2.  For  j  =  1, . . . ,  M 

set  X^{n+  1)  =  p{X^  {n),ei,  Wn+i) 

Massively  parallel  implementation 

The  Connection  Machine  (CM)  that  we  have  used  as 
the  platform  for  the  massively  parallel  implementation 
of  the  SC  algorithm  is  a  SIMD  (Single  Instruction  Mul¬ 
tiple  Data)  computer.  It  consists  of  a  large  number  of 
small  processors  (32000  in  our  case)  each  with  its  as¬ 
sociated  memory.  All  the  processors  operate  under  the 
direction  of  a  serial  computer,  called  the  front  end.  The 
front  end  acts  as  a  central  control  mechanism  that  directs 
all  processors  as  to  the  next  instruction  to  be  executed. 
All  processors  then  execute  the  same  instruction;  hence 
the  name.  Single  Instruction  Multiple  Data  (SIMD)  sys¬ 
tems  (for  more  extensive  description  of  parallel  and  dis¬ 
tributed  systems  see  [l],[4]).  In  massively  parallel  sys¬ 
tems  the  synchronization  of  the  computational  tasks  is  a 
crucial  element  of  the  parallel  implementation.  The  SC 
algorithm  is  particularly  well  suited  for  such  implemen¬ 
tation. 


We  simulate  algorithm  A  (i.e.  the  clock  mechanism) 
at  the  front  end  computer:  at  each  tick  of  this  clock  the 
time  and  type  of  the  “next”  event  is  generated.  Algo¬ 
rithm  B  is  implemented  in  a  distributed  fashion  at  the 
CM:  each  processor  of  the  CM  simulates  a  version  of  the 
system  with  a  distinct  parameter  value.  The  event  type 
and  time  generated  by  the  clock  is  broadcast  to  all  pro¬ 
cessors  which  in  turn  execute  the  instruction  correspond¬ 
ing  to  the  event  type.  This  execution  is  done  according 
to  the  parameter  value  at  the  processor.  To  illustrate 
we  consider  the  parallel  implementation  of  the  model  of 
example  3.1  at  a  finite  number  (M)  of  parameter  values: 

Example  4.1  :  The  implementation  of  the  clock  mech¬ 
anism  (Algorithm  A)  at  the  front  end  is  trivial.  To 
implement  Algorithm  B  we  define  parallel  variables 
Xi, zjc-i  to  represent  the  states  of  the  systems  at  all 
variants:  is  an  M  dimensional  parallel  variable  whose 

every  components  is  kept  at  a  distinct  processor.  The 
value  kept  at  processor  j  is  the  number  of  customers  at 
buffer  1  at  the  configuration  associated  with  processor  j. 
Similarly  we  define  parallel  variables  Bi,. . Bk-i  (the 
jth  component  of  Bi,  kept  at  processor  j,  is  the  number 
of  buffers  between  i  and  (»  -f-  l)th  servers  at  configura¬ 
tion  j).  Assume  that  the  event  reported  by  the  front 
end  (the  clock  mechanism)  is  d,  (for  simplicity  assume 
l<»<Jif  —  1).  To  update  the  states  of  the  systems  we 
proceed  as  follows: 

Define  a  logical  parallel  variable  A  as  : 

I  A  =  1  if  Zi  >  0,  ij+i  <  Bi+i 

(  A  =  0  otherwise 

and  execute  the  following  code  (the  code  is  executed  in 
parallel  on  CM) 

set  A  =  (zj  >  0  and  z<+i  <  ^j+i) 
z<  =  Zj  —  A 
*i+l  =  *i+l  +  A 

The  component  of  A  at  processor  j  takes  value  1  if  zj  >0 
and  Zj^i  <  B^^i  are  both  satisfied;  otherwise  it  takes 
value  0  (these  condition"!  are  checked  in  parallel  at  each 
processor  based  on  local  information  at  the  processor). 
The  next  two  steps  represent  the  movement  of  a  part 
for  all  processors  where  A  =  1  and  no  action  for  those 
where  A  =  0. 

5  Performance  optimisation 

Such  massively  parallel  implementations  dramatically 
increase  our  ability  to  generate  data  points  (performance 
estimates)  for  analysing  and  optimizing  queueing  net¬ 
work  performances.  An  immediate  and  important  ques¬ 
tion  to  be  answered  is:  what  type  of  optimization  algo¬ 
rithms  are  most  appropriate  in  this  context. 
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As  a  first  step  we  have  considered  a  random  global 
search  approach  on  the  parameter  space.  Due  to  a  lack  of 
space  and  to  the  preliminary  nature  of  our  investigation 
our  discussion  below  will  be  informal. 

Consider  the  following  optimization  problem: 

Max{J(fl)i(9  e  0}  (3) 

where  7(0)  —  limT—oo  fgiX{6,ui,T),  (w  £  D  repre¬ 
sents  all  the  underlying  randomness  in  the  system).  J{6) 
is,  for  example,  some  average  steady  state  performance 
of  the  network  at  parameter  value  6.  To  address  this 
problem  we  proceed  as  follows: 

Let  01 , . . . ,  6m  be  M  parameter  values  in  0  chosen 
randomly  according  to  some  distribution  on  0.  We 
run  A'(0i,w, .),...,  A'(0jvf,a;, .)  in  parallel  and  evaluate 
g{X{6i,u),Tk)),. .  at  different  epochs 

Tk  (k  =  1.?,...).  At  each  epoch  the  parameter  values 
are  ranked  in  descending  order  of  g(X{6j  ,u),Ti,)).  We 
choose  the  best  L  parameter  values  at  each  T*  (those 
with  highest  value  of  g);  when  this  population  “stabi¬ 
lizes”  (i.e.  when  there  is  a  small  migration  in  and  out  of 
the  population,  or  changes  in  ranking  within  the  popu¬ 
lation)  the  simulation  is  stopped. 

The  following  considerations  has  been  the  basis  of  our 
approach:  our  objective  is  to  find  near  optimal  solu¬ 
tions  to  (3).  For  “large”  values  of  M,  and  “reasonable” 
performance  functions  7(0),  the  top  L  parameter  val¬ 
ues  at  the  termination  of  the  simulation  are  expected  to 
be  “near-optimal”  with  “high  probability”,  i.e.  produce 
performance  measures  that  are  close  to  Max  7(0).  Fur¬ 
thermore,  in  the  context  of  networks  of  queues  they  arc 
expected  to  reveal  some  of  the  “desirable”  properties  of 
near-optimal  variants.  A  concurrent  comparison  of  sam¬ 
ple  performances  of  all  variants  is  possible  because  in 
the  SC  simulation,  all  variants  live  in  the  same  simu¬ 
lated  world.  This  approach  is  identical  to  some  of  the 
coupling  methods  of  sample  paths  of  stochastic  processes 
-  by  defining  them  on  the  same  probability  space  -  to  es¬ 
tablish  stochastic  monotonicity  [e.g.  sec  5]. 

Erample  5.1  :  Consider  the  system  of  example  3.1 
with  the  following  modification:  there  are  11  servers  in 
the  system  and  server  »  is  Erlang(r,-,  /r).  Consider  the 
problem  of  optimal  allocation  of  20  buffers  between  the 
servers  in  order  to  maximize  throughput.  In  our  exam¬ 
ple  p  =  1  and  (ri,...,r„)  =  (1, 2,  5,  4,  4,  2, 2,  3,  2,  5,2). 
4000  variants  of  the  system  (numbered  1  through  4000) 
were  randomly  selected  and  simulated  in  parallel  on  CM. 
At  T\  =  tsooojTz  =  Tiooooi  n-nd  T3  =  tisooo  fbe  parame¬ 
ters  were  ranked  (the  simulation  was  performed  in  about 
20  sec).  By  T3,  the  top  20  ranked  variants  had  “stabi¬ 
lized”  (by  this  time  in  the  “best”  configuration  243  parts 


were  produced).  The  ranked  configurations  were  also  ob¬ 
served  at  T4  =  Tioooooi  Ts  =  Tisoooo  to  check  for  possible 
long  term  change  in  ranking.  The  Table  below  shows  the 
rank  of  5  top  configurations  at  Ti,  T3,  T4,  T^. 


config. 

R  at  Ti 

R  at  T2 

F..  at  T3 

Rat  T4 

570 

1 

1 

1 

1 

3321 

19 

4 

2 

2 

907 

6 

3 

3 

3 

1756 

15 

6 

4 

4 

1277 

3 

2 

5 

5 

They  correspond  to  the  following  allocation  of  buffers 
(we  have  also  included  the  number  of  parts  produced  at 
these  configurations  by  Tj: 


bsro  =  (1)2, 3, 3, 2, 1,  2, 2, 2, 2),  Parts  produced  =  2449. 
^>33si  =  (1)  2, 3, 2, 2, 1, 2,  2,  3, 2)  Parts  produced  =  2397. 
^907  =  (1)3,  3, 2, 2, 1,  2, 2, 2, 2)  Parts  produced  =  2385. 
hi 759  =  (2,  2, 3,  2,  2, 1 , 2, 2, 2,  2)  Parts  produced  =  2378. 
hij77  =  (1)  2, 3,  2,  2, 3, 1,  2,  2,  2)  Parts  produced  =  2362. 
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Abstract 

This  paper  describes  Version  3  of  a  GPSS 
compiler.  GPSS  is  a  discrete  event  simulation 
language  used  to  model  queuing  problems.  The 
compiler  was  written  in  the  SAS  language  (version 
6.06),  which  was  chosen  for  three  reasons:  (1)  it  has 
character  string  handling  and  other  functions 
required  for  a  compiler,  (2)  the  SAS  language  has  a 
full  range  of  mathematical  and  statistical  functions 
that  are  used  to  extend  the  GPSS  syntax  and  (3)  the 
statistical  procedures  in  the  SAS  system  are  available 
to  preprocess  data  for  the  simulation  or  to 
postprocess  simulation  output. 

The  current  version  of  the  compiler  implements 
much  of  the  GPSS  functionality  and  contains  the 
usual  devices  in  a  simulation  language  including  a 
clock  mechanism,  an  event  scheduler,  a  source  of 
random  numbers  following  a  large  number  of 
probability  distributions  and  data  structures  to 
represent  queues  and  other  required  quantities. 

V 

I.  Introduction 

A.  Simulation  Language 

A  simulation  language  is  a  computer  language 
which  facilitates  the  programming  of  models  for 
discrete-event  simulations.  It  is  useful  for  solving 
queueing  problems  because  it  has  constructs  which 
represent  all  the  aspects  of  the  queuing  situation.  It 
is  possible  but  tedious  to  program  a  simulation 
problem  in  a  high  level  language  such  as  Fortran.  A 
simulation  language  automatically  handles  many 
tasks  such  as  maintaining  a  simulated  clock, 
scheduling  events  and  causing  them  to  occur  in  the 
proper  time-ordered  sequence.  In  addition,  most 
simulation  languages  automatically  collect  data 
describing  the  model’s  simulated  behavior  and  print 
out  summaries  of  these  data.  Thus  much  of  the 
underlying  logic  of  the  simulation  of  the  queuing 
problem  is  built  into  the  simulation  language. 

We  describe  also  in  the  paper  how  the  compiler 
performs  typical  functions  such  as  storage  allocation, 
symbol  table  maintenance,  cross  referencing,  garbage 
collection  and  error  messaging.  Applications  for  this 
compiler  and  some  thoughts  on  using  the  SAS 
language  as  the  development  are  also  discussed. 

A  GPSS  program  consists  of  a  sequence  of 
statements,  called  blocks,  which  correspond  to  the 
boxes  in  the  flow  diagram  of  a  queing  model.  The 


GPSS/SAS  compiler  translates  a  GPSS  program  into 
a  SAS  program  using  the  SAS  language  first. 
Entities  called  transactions  (for  example, 
representing  customers)  move  through  these  blocks. 
At  any  simulated  instant,  there  may  be  many 
transactions  in  different  parts  of  the  flow  diagram. 
Transactions  can  model  the  movement  of  customers 
through  a  facility.  Usually  a  transaction  is  on  one  of 
two  lists,  the  current  events  chain  (CEC),  or  the 
future  events  chain  (FEC).  Transactions  on  the  CEC 
are  moving  or  ready  to  move  through  the  blocks  in 
the  program.  They  can  be  held  up  if  a  block  refuses 
entry  or  can  be  delayed.  Transactions  on  the  FEC 
will  move  later  when  the  simulated  clock  reaches 
their  block  departure  time,  at  which  time  they  will 
be  transfered  to  the  CEC  to  continue  progress.  At 
any  instant  of  simulated  time,  GPSS  tries  to  move  all 
each  current  (CEC)  transaction  as  far  as  possible 
through  the  block  diagram.  Every  transaction  has  a 
priority,  which  can  be  changed  as  it  goes  through  the 
program.  The  CEC  is  in  order  of  highest  to  lowest 
priority,  causing  transactions  of  high  priority  to 
move  before  those  of  lower  priority  at  any  given 
simulated  time.  Transactions  also  have  parameters 
which  may  be  used  to  carry  data. 

B.  Why  SAS? 

SAS  was  chosen  as  the  language  in  which  to  write 
the  compiler  for  the  following  reasons:  (1)  the 
completeness  and  flexibility  of  SAS  as  a 
programming  language  (2)  the  capability  for  outputs 
to  be  analyzed  through  the  immediate  access  to  SAS’s 
high  quality  statistics  and  graphics  procedures,  and 
(3),  SAS  has  good  random  number  generators,  built- 
in  mathematical  functions  and  character  string¬ 
handling  f unctions  usef ul  in  parsing  program  coding. 
The  disadvantages  to  using  SAS  are  that  there  are  no 
multidimensional  arrays  and  the  execution  speed  is 
relatively  slow. 

C.  Background  on  Previous  Versions 

The  original  version  of  the  GPSS/SAS  compiler  is 
described  in  ’A  GPSS-like  Language  in  SAS  for 
Discrete  Event  Simulation’  (Proceedings  of  SUGI, 
1988).  At  that  time,  the  program  consisted  of  a 
single  SAS  data  step.  The  GPSS  language  statements 
which  were  implemented  were  GENERATE, 
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ASSIGN,  TRANSFER.  ENTER,  ADVANCE, 
LEAVE,  AND  TERMINATE. 

The  second  version  of  the  program  is  described 
in  ’How  to  Stop  a  Simulation’  (Proceedings  of  SUGI, 
1990).  At  this  point,  the  compiler  was  separated  into 
five  data  steps,  in  order  for  the  work  to  be 
modularized  (see  figure  1).  The  main  data  steps 
were  LEXAN,  PASS2  and  RUNSTEP.  These  data 
steps  call  TABLES  (which  contains  the  symbol 
table),and  ERRMSGS  (which  contains  the  error 
messages).  LEXAN  is  a  lexical  analyzer  those  main 
job  is  to  translate  free  format  mixed  case  input  to 
fixed  format  uppercase,  and  also  to  compress  spaces. 
The  output  from  LEXAN  is  passed  to  PASS2,  where 
most  of  the  compiling  and  code  generating  takes 
place.  Output  from  PASS2  goes  into  RUNSTEP, 
where  the  simulation  execution  occurs.  The 
operation  of  the  simulated  clock,  the  scheduling  of 
events,  and  the  movement  of  transactions  from  block 
to  block  is  all  a  part  of  RUNSTEP. 

In  addition  to  the  features  implemented  in 
Version  1,  two  new  statements  were  implemented  in 
Version  2:  REGS  (regenerative  start),  and  REGE 
(regenerative  end),  blocks  which  cause  counting  of 
the  number  of  transactions  and  waiting  times  . 
These  features  were  meant  to  be  used  to  collect 
queue  statistics.  This  then  permits  stopping  the 
simulation  after  a  completion  of  enough  events  to 
allow  interval  estimation  of  parameters  with 
appropriate  precision. 

We  have  now  completed  Version  3  of  the  compiler. 
Version  3  has  the  same  structure  as  Version  2,  but 
implements  a  much  larger  subset  of  GPSS  including 
MATRIX  handling,  parameters,  GATES  and 
LOGIC,  TEST,  etc. 


11.  The  structure  of  our  Compiler. 

Version  3  consists  of  three  main  data  steps 
working  with  5  files.  Figure  1  below  shows  the  way 
they  work  together. 

Figure  1 

Compiler  Phases 


SIMDATA - >LEXAN< - TABLES 

I 

SELECT< - PASS2 - >INITIALS 

I  I  I 

I— . — >  RUNSTEP< - I 


A.  LEXAN,  a  lexical  analyzer.  Changes  the  free 
format  of  the  program,  SIMDATA,  to  fixed  format, 
lowercase  to  upper  case,  does  space  compression, 
puts  entities  into  labels,  and  reads  in  symbol 
TABLES.  It  passes  the  analyzed  GPSS  program  to 
PASS2. 

B.  PASS2  does  entity  translation,  symbol  table 
maintenance,  storage  allocation,  macro  variable 
creation  for  array  dimensioning  going  into  the 
INITIALS  data  set,  syntax  analysis,  compile  time 
error  messaging,  translation  of  GPSS  random  number 
calls  to  SAS  random  number  subroutine  calls,  and 
creation  of  the  dynamic  half  of  RUNSTEP,  the  file 
SELECT.  It  passes  the  compiled  program  to 
RUNSTEP. 

C.  RUNSTEP  does  the  actual  execution  of  the 
compiled  code  from  SELECT  and  INITIALS. 
Dynamic  storage  allocation  is  done  from  INITIALS. 
Also  done  are  garbage  collection  from  parameter 
arrays,  run-time  error  messaging,  simulation  event 
tracing  and  output  in  the  form  of  REPORT. 

III.  Examples  and  sample  output  from  the  compiler. 

The  text  below  describes  how  GPSS  operands  are 
translated  into  SAS  statements  by  the  compiler.  The 
usual  form  of  a  GPSS  language  statement  is: 

LAB  OP-FLD  AUX  OPRNDS 
where  LAB  refers  to  LABEL,  OP-FLD  refers  to 
OPERAND-FIELD,  AUX  refers  to  AUXILIARY, 
and  OPRNDS  refers  to  OPERANDS.  LABELS 
identify  either  the  statement  or  the  entity  such  as  a 
STORAGE  or  a  MATRIX.  In  Figure  2 
below,  the  first  line  has  the  label  "MIKE"  which  is 
the  name  of  the  matrix  to  be  dimensioned.  Labels 
are  sometimes  optional.  OPERATION  FIELDS 
define  the  purpose  of  the  GPSS  statement.  Line  1 
has  the  operation  "MATR(IX)"  which  causes 
dimensioning.  Line  S  has  operation  "GENE"  or 
"GENERAT"  which  causes  production  of  a 
transaction.  Auxiliaries  are  adjuncts  to  operations 
which  further  define  the  operation.  Line  8  has  an 
auxiliary,  to  TEST  on  LESS  THAN.  Operands  (up 
to  8)  are  found  to  the  right  of  operations  (or 
auxiliaries  if  present). 
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Figure  2 

MIKE  MATR  3  2 

INIT  M$MIKE(3,2)  1 
INIT  XSJON  2 
XEROX  STOR  2 

LABEL  1  GENE  RANUNI(X$JON)  2  15  .  .  21 
MSAV  MIKE  3  2  M$MIKE(3,2)+1 
ASSI  1  M$MIKE(3,2) 

TEST  L  M$MIKE(3,2)  12  SKIP 
ENTE  XEROX 
ADVA  5*Pl 
LEAV  XEROX 
SKIP  TERM  1 
STAR  4 
END 


Figure  2  represents  nearly  original  GPSS  source 
code  which  has  been  processed  by  LEXAN. 

The  program  is  changed  by  PASS2.  PASS2 
performs  GPSS  entity  translation,  symbol  table 
maintenance,  storage  allocation,  macro  variable 
creation  for  array  dimensioning,  syntax  analysis, 
compile-time  error  messaging,  translation  of  GPSS 
random  number  calls  to  SAS  random  number 
subroutine  calls,  and  creation  of  the  dynamic 
portions  of  RUNSTEP.  In  that  part  of  the  program, 
GPSS  labels  are  assigned  to  SAS  variables  and 
initialized,  storage  for  various  GPSS  entities  is 
created,  GPSS  operands  are  translated  to  SAS 
expressions  and  pointers  of  various  types  are  set  up. 
Then  RUNSTEP  (Section  III.  C.  above)  needs  only 
two  pieces  of  information,  (1)  the  type  of  operation 
field  being  executed,  and  (2)  the  values  of  the 
operands  at  the  time  the  statement  is  being  executed. 
The  operation  field  code  is  passed  through  the 
PASS2  SAS  dataset,  while  the  operands  values  are 
obtained  in  one  of  the  dynamic  portions,  the 
SELECT  File. 

The  SELECT  file  evaluates  the  operands  and  the 
final  values  are  set  to  be  _T1,_T2,  ...etc  up  to  _T8. 
Figure  3  shows  the  SELECT  file  for  statement 
number  5.  (LABEL  1  GENE  ...)  First  a  temporary 
value  is  set  to  the  first  savevalue,  X{001).  Then  a 
call  to  RANUNI  is  made  with  XSJON  as  the  seed, 
the  random  number  being  put  in  _T101.  Then 
XSJON  is  set  back  to  the  new  seed,  _T1  is  set  to 
_T101,  the  first  operand.  The  third  operand  is  set 
to  15,  and  the  sixth  to  2.  What  is  occurring  is  that 
PASS2  is  translating  GPSS  code  into  SAS  code  which 
then  gets  appended  to  the  end  of  the  RUNSTEP  data 
step.  In  this  manner  any  valid  SAS  statement  can  be 
used  as  GPSS  operands,  representing  a  substantial 


extension  to  the  language. 


Figure  3 


When  (005)  DO; 

TEMPO  11  =  X{001}; 

CALL  RANUNI(TEMP011,_T101); 

X{001)  =  TEMPOll; 

_T1  =  _T101; 

_T2  =  .; 

_T3  =  15; 

_T4  =  .; 

_T5  =  .; 

_T6  =  2; 

END; 

The  main  data  structures  in  RUNSTEP  are  those 
associated  with  transactions  TA{*),  NEXT{*}, 
_PARMTX(*}(the  parameter  array),  and  block 
arrays  (BLKTYPE{*),  BLKAUX{*),  BLKCNT{*} 
and  BLKMISC{*)).  The  transaction  array  is 
dimensioned  beforehand  to  the  best  guess  at  the 
maximum  number  of  active  transactions  *  10.  The 
block  arrays  are  dynamically  dimensioned  to  the 
block  counts  in  INITIALS  (see  figure  4).  The 
blockarrays  contain  information  which  is  specific  to 
the  blocks.  Filling  the  block  arrays  is  the  last  step  in 
compilation  and  is  done  in  the  beginning  of 
RUNSTEP.  BLKTYPE{*)  gets  a  number 
representing  the  operation  field  of  the  block. 
BLKAUX{*)  gets  the  auxiliary  operand. 
BLKMISC{*}  is  used  for  miscellaneous  operations  on 
blocks  such  as  the  value  of  the  logical  evaluation  of 
the  test  block,  etc.  BLKCNT{*}  is  the  block  count 
or  the  number  of  transactions  which  have  passed 
through  the  block. 

TA{*}  is  a  linked  list  with  the  pointers  in 
NEXT{*}.  TA{*)  contains  most  of  the  relevant 
information  about  the  transaction  including  its  block 
departure  time  (BDT),  number,  current  block 
occupied  by  the  transaction,  transaction  status 
(active,blocked  or  terminated), maximum  number  of 
parameters,  pointer  to  starting  place  in  the  parameter 
matrix(_PARMTX{*))  and  priority).  BDT 
represents  the  time  that  the  transaction  may  be 
moved  from  its  current  block.  Transactions  are 
linked  by  BDT  and  priority,  that  is  NEXT{i)  points 
to  the  transaction  with  the  same  (and  lower  priority) 
BDT  or  next  larger  BDT.  This  allows  scanning  the 
transaction  array  from  the  beginning  in  order  to  find 
the  next  transaction  to  be  moved.  Transactions  are 
inserted  in  the  Ta{*)  array  when  created  in  the 
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GENERATE  block  and  removed  when  they  move 
through  the  TERMINATE  block.  The  position  of 
the  transaction  may  be  modified  by  traversing 
through  an  ADVANCE  (which  causes  revision  of  the 
BDT)  or  a  PRIORITY  block  (changing  the  priority). 

Before  beginning  the  simulation,  at  the  point 
where  each  GENERATE  block  is  symbolized,  its 
first  transaction  is  made,  space  allocated  in  the 
_PARMTX{*}  array,  its  block  departure  time 
computed,  and  it  is  installed  in  the  TA{*}  linked  list. 
The  first  transaction  is  then  taken  off  the  top  of  the 
linked  list  and  the  simulation  clock  is  set  to  its  BDT. 
Then  the  following  pattern  ensues: 

1.  The  transaction  is  moved  as  far  as  it  can  be 
moved.  It  is  then  destroyed  or  put  back  into  the 
linked  list. 

2.  The  next  transaction  is  identified. 

3.  The  simulation  clock  is  updated  if  required  to 
the  block  departure  time  for  the  next 
transaction. 

4.  If  the  termination  counter  is  zero,  the  simulation 
stops,  otherwise  return  to  step  1 . 

Figure  4 

INITIALS 

ARRAY  M  {*)  Ml  -  _M7 ; 

ARRAY  OFF  {*)  OFFl-  _OFF2  ; 

ARRAY  NC  {*}  _NC1-  _NC2 ; 

ARRAY  X  {*}  _X1  -  _X2; 

ARRAY  BLOKTYPE  {*}  _BT1  -  _BT15  ; 
ARRAY  BLOKAUX  {*)  $2  _BX1  -  _BX15  ; 
ARRAY  BLOKCTS  {•}  _BCTS1  -  _BCTS15  ; 
ARRAY  BLOKCNT  {•}  BCNTl  -  BCNTIS; 
ARRAY  BLOKMISC{*}  _BMIS1  -  _BMIS15  ; 
ARRAY  STORCAP  {♦)  _STC1  -  _STC2  ; 
ARRAY  STORUSE  {*)  STUl  -  _STU2  ; 


RETAIN 

STCl 

2 

STUl  0 

NCI 

2 

_OFFl  1 

NULL  0000 
LABEL  1  0006 
SKIP  0013 


test  this  using  practical  applications  in  the  near 
future. 
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Conclusion: 

We  think  the  third  version  has  represented  a 
substantial  extension  over  other  versions.  We  plan  to 
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Abstract 

Confidence  intervals  obtained  by  bootstrap  methods  and 
normal  approximation  are  compared,  based  on  out¬ 
put  data  from  terminating  and  steady-state  simulations. 
Bootstrap  intervals  are  equal  or  better  than  normal  ap¬ 
proximation  intervals  in  actual  probability  coverages. 
Furthermore,  bootstrap  methods  capture  the  skewness 
in  the  distribution  of  outputs  and,  therefore,  are  more 
desirable  than  normal  approximation. 

1  Introduction 

Computer  simulation  is  a  method  for  studying  a  system 
or  process  which  is  far  too  complex  to  easily  derive  ana¬ 
lytic  results  for  performance  measure  of  interest.  Usually 
several  simulation  runs  are  conducted  and  the  result¬ 
ing  output  data  are  employed  to  make  inference  about 
performance  measure;  for  instance,  the  average  delay  in 
the  queueing  system.  Here  we  assume  proper  steps  have 
been  taken  so  that  the  outputs  from  either  terminating 
or  steady-  state  simulation  are  independently  and  iden¬ 
tically  distributed.  Law  [4]  gives  precise  definition  of  the 
two  types  of  simulations.  In  this  article  the  regenerative 
method  is  considered  for  the  case  of  steady-state  simu¬ 
lation.  Central  limit  theorem  (normal  approximation)  is 
the  most  common  technique  for  constructing  confidence 
interval  for  performance  meaisure.  This  is  because  it  is 
easy  to  use  and,  when  the  size  of  replications  is  large,  it 
yields  very  accurate  results.  However,  it  does  not  cap¬ 
ture  the  asymmetric  nature  of  underlying  distribution 
of  the  output  data.  Since  the  distribution  of  data  is 
rarely  known,  we  are  dealing  with  nonparametric  situa¬ 
tion  where  bootstrap  method  [3]  proves  to  be  useful  in 
that  it  takes  into  account  of  asymmetry  involved  and  is 
as  easy  to  implement  as  normal  approximation.  A  brief 
description  of  bootstrap  methods  and  related  confidence 
intervals  for  a  meam  are  given  in  section  2.  Section  3 
contains  confidence  intervals  obtained  by  the  two  meth¬ 
ods  for  M/M/1  queue  and  reliability  model,  which  are 
pertinent  to  terminating  simulation.  M/M/1  queue  is 


studied  again  using  regenerative  method  in  section  4. 
Comparisons  are  made  of  Jackknife  and  bootstrap  confi¬ 
dence  intervals  for  the  steady-state  average  delay  in  the 
system.  Section  5  includes  some  conclusions. 

2  Bootstrap  Methods 

Bootstrap  method  is  a  resampling  scheme.  It  uses  a 
given  set  of  independently  identically  distributed  obser¬ 
vations  Y  =  {Xi , . . . ,  Xn)  from  an  unknown  distribution 
F  to  construct  an  empirical  distribution  F.  Random 

samples  Y{ . Yg  are  then  taken  from  F.  This  is  the 

same  as  sampling  from  {Xi, ,  Xn}  with  replacement. 
Suppose  0  is  the  parameter  of  interest  and  0  is  an  es¬ 
timate  of  0.  The  bootstrap  estimates  01 . 0g  can  be 

calculated  from  Y^ , . . .  ,Yg,  which  are  used  to  assess  the 
accuracy  of  0  or  to  form  bootstrap  distribution  G,  de¬ 
fined  by  G(s)  =  Pr[0'  <  s].  Using  ath  and  (1  — a)th  per¬ 
centiles  of  G  as  endpoints  of  interval  will  yield  a  (1  —  2o) 
100%  confidence  interval  for  0.  This  is  the  simplest  of 
bootstrap  methods  for  constructing  confidence  intervals 
and  is  called  percentile  method  (P). 

Improvements  on  the  percentile  method  have  been 
proposed,  noticeably  the  bias  corrected  percentile 
method  (BC)  and  bias  corrected  percentile  acceleration 
method  (BCa).  Edgeworth  expansion  technique  can  be 
employed  to  get  asymptotic  expressions  for  the  endpoints 
of  the  BCa,  BC  and  percentile  intervals  for  the  case  of 
estimating  the  population  mean,  fi  [2]. 

^(BCa)[0‘]  =  *  +  +  ®[2<^(«)  +  1]}  (1) 

0{BC)[a]  =  i  +  ^]}  (2) 

0(P)[q]  =  x-l- ^{t(a)-|-a}  (3) 

where  i  and  s  aie  the  mean  and  standard  deviation  of 
the  data,  t{a)  is  the  ath  percentile  of  t  distribution  with 
(n  —  1)  degrees  of  freedom  and 
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Thus,  (^(p)[«],  ^(P)[l  —  a])  gives  a  (1  —  2a)  100%  confi¬ 
dence  interval  for  n  by  the  percentile  method. 

3  Confidence  Intervals  for  Ter¬ 
minating  Simulation 

In  order  to  compare  bootstrap  methods  with  normal  ap¬ 
proximation,  we  study  the  following  systems. 

•  Model  1  —  M/M/1  queueing  system  with  utiliza¬ 
tion  factor  p  =  0.9  ([5],  p.289).  Assuming  that  the 
number  of  customers  in  the  queue  at  time  0  is  zero, 
the  performance  measure  of  interest  is  the  expected 
average  delay  in  the  queue  for  the  first  25  customers 
entering  the  system,  which  is  2.124. 

•  Model  2  —  Reliability  model  ([5],  p.289)  consisting 
of  three  components,  each  of  which  has  a  lifetime 
following  Weibull  distribution  with  shape  parame¬ 
ter  0.5  and  scale  parameter  1.0.  The  model  is  struc¬ 
tured  in  such  a  way  that  the  system  will  function  as 
long  as  component  1  works  and  either  component  2 
or  3  works.  The  performance  measure  of  interest  is 
the  mean  lifetime  of  the  system,  which  cein  be  shown 
to  be  0.778. 

The  (1  —  2a)  100%  confidence  interval  for  the  measure 
of  each  system,  based  on  the  central  limit  theorem,  is 

i±<(l-a)^  (4) 

and  the  corresponding  bootstrap  confidence  interval  are 
given  by  equation  (1),  (2)  and  (3). 

500  simulation  runs  are  conducted  for  each  model  and, 
for  each  run,  replication  sizes  n  =  5,  10,  20  and  40  are 
considered.  The  true  confidence  level  is  90%.  The  actual 
coverage  probabilities  along  with  90%  confidence  interval 
of  the  true  coverages  are  summarized  in  table  1  and  2. 


Sample  Size 

20 

40 

Normal  App. 
Bootstrap(P) 
Bootstrap(BC) 
Bootstrap(  BCa) 

.880  ±  .024 
.880  ±  .024 
.876  ±  .024 
.880  ±  .024 

.882  ±  .024 
.886  ±  .023 
.894  ±  .023 
.894  ±  .023 

Table  2  -  Estimated  Coverage  Results  for  Model  2 
(Weibull  Model) 


Sample  Size 

5 

10 

Normal  App. 
Bootstrap  (P) 
Bootstrap  (BC) 
Bootstrap  (BCa) 

.700  ±  .034 
.710  ±  .033 
.738  ±  .032 
.740  ±  .032 

.758  ±  .032 
.7-i2  ±  .031 
.790  ±  .030 
.790  ±  .030 

Sample  Size 

20 

40 

Normal  App. 
Bootstrap  (P) 
Bootstrap  (BC) 
Bootstrap  (BCa) 

.816  ±  .029 
.820  ±  .028 
.836  ±  .027 
.780  ±  .030 

.840  ±  .027 
.838  ±  .027 
.842  ±  .027 
.842  ±  .028 

The  distributions  involved  are  quite  skewed  as  indi¬ 
cated  by  the  sample  skewness,  which  are  1.755  and  5.35 
for  model  1  and  2,  respectively.  However,  equation  (4) 
always  provides  symmetric  interval  that  is  of  course  un¬ 
realistic.  The  asymmetry  of  a  confidence  interval  for 
mean  can  be  described  by  the  asymmetry  coefficient,  de¬ 
fined  by  j3f§y  where  UB  and  LB  are  upper  and  lower 
confidence  bounds  respectively.  Table  3  and  4  contain 
the  values  of  coefficient  for  each  model.  It  is  apparent 
that  all  bootstrap  intervals  capture  this  asymmetry. 


Table  3  -  Asymmetry  Results  for  Model  1  (Terminating 
M/M/1  Queue) 


Sample  Size 

5 

10 

20 

40 

Normal  App. 

1.000 

1.000 

1.000 

1.000 

Bootstrap  (P) 

1.046 

1.061 

1.056 

1.049 

Bootstrap  (BC) 

1.282 

1.299 

1.243 

1.201 

Bootstrap  (BCa) 

1.580 

1.598 

1.468 

1.378 

Table  1  -  Estimated  Coverage  Results  for  Model  1  Table  4  -  Asymmetry  Results  for  Model  2  (Weibull 

(Terminating  M/M/1  Queue)  Model) 


Sample  Size 

5 

10 

Normal  App. 
Bootstrap(P) 
Bootstrap(BC) 
Bootstrap(BCa) 

.844  ±  .027 
.842  ±  .027 
.838  ±  .027 
.840  ±  .027 

.868  ±  .025 
.874  ±  .025 
.880  ±  .024 
.882  ±  .024 

Sample  Size 

5 

10 

20 

40 

Normal  App. 

1.000 

1.000 

U[Ij^ 

Bootstrap  (P) 

1.075 

1.109 

Wml 

Bootstrap  (BC) 

1.504 

1.525 

1.517 

1.452 

Bootstrap  (BCa) 

2.155 

2.165 

2.122 

1.944 
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More  evidence  for  supporting  the  asymmetric  correct¬ 
ness  of  bootstrap  confidence  intervals  can  be  found  by 
studying  a  system  of  which  the  performance  measure  can 
be  derived  anedytically. 

•  Model  3  —  Estimation  of  mean  service  time  for 
M/M/1  queueing  system  when  the  actual  service 
times  follow  exponential  distribution  with  mean  1. 

The  (1  —  2a)  100%  level  exact  confidence  interval  for 
the  mean  service  time  can  be  calculated  by 

2nx  2nx 


where  percentile  of  chi-square  distri¬ 

bution  with  m  degrees  of  freedom. 

Table  5  contains  the  average  endpoints  of  normal  ap¬ 
proximation  and  bootstrap  confidence  intervals  based  on 
500  simulation  runs.  The  endpoints  of  exact  intervals  are 
also  included.  Exact  intervals  are  asymmetric.  Boot¬ 
strap  intervals  converge  to  them  as  t^—^oo,  with  BCa 
intervals  being  most  correct.  For  this  model  the  prob¬ 
ability  coverages  of  bootstrap  methods  are  better  than 
normed  approximation.  Sample  skewness  of  the  data  is 
2.091  and  asymmetry  results  are  given  in  table  6. 


Table  5  -  Comparison  of  approximate  confidence 
intervals  for  model  3  versus  exact  confidence  interval 


Sample  Size 

5 

10 

Normal  App. 
Bootstrap  (P) 
Bootstrap  (BC) 
Bootstrap  (BCa) 
Exact 

0.174,  1.836 
0.193,  1.855 
0.279,  1.941 
0.366,  2.027 
0.546,  2.538 

0.463,  1.550 
0.481,  1.567 
0.539,  1.625 
0.579,  1.683 
0.637,  .1.843 

Sample  Size 

20 

40 

Normal  App. 
Bootstrap  (P) 
Bootstrap  (BC) 
Bootstrap  (BCa) 
Exact 

0.625,  1.360 
0.636,  1.372 
0.671,  1.407 
0.705,  1.441 
0.717,  1.509 

0.740,  1.259 
0.747,  1.266 
0.766,  1.285 
0.785,  1.304 
0.785,  1.325 

Table  6  -  Asymmetry  Results  for  Model  3 


Sample  Size 

5 

10 

20 

40 

Normal  App. 

1.000 

1.000 

1.000 

1.000 

Bootstrap  (P) 

1.047 

1.066 

1.065 

1.053 

Bootstrap  (BC) 

1.291 

1.322 

1.287 

1.220 

Bootstrap  (BCa) 

1.600 

1.650 

1.562 

1.418 

4  Confidence  Intervals  for  Re¬ 
generative  Simulation 

In  this  section  an  example  of  steady-state  simulation  is 
considered. 

•  Model  4  —  M/M/1  queueing  system  ([5],  p.300) 
with  utilization  factor  p  =  0.8.  The  regenerative 
method  developed  by  Crane  and  Iglehart  [1]  gener¬ 
ates  Vi  and  Xi  for  each  regenerative  cycle,  where  Yi 
represents  the  total  delay  in  the  queue  of  all  cus¬ 
tomers  served  in  the  ith  cycle  and  Xi  represents  the 
total  number  of  customers  served  in  the  ith  cycle. 
The  performance  measure  of  interest  is  the  steady- 
state  average  delay  in  the  queue  given  hy  R  = 

=  3.2. 

An  estimator  (or  R  is  R  =  but  R  is  not  unbiased. 
Jackknife  technique  can  be  employed  to  reduce  bias  and 
to  construct  confidence  interval  for  R  as  follows. 

1.  For  each  i,  compute  Zj,  where 


Zi 


En 


2.  A  (1  —  2a)  100%  Jackknife  confidence  interval  is 
given  by 


z  ±  1(1  —  a)-^ 
y/n 

where  z  eind  cf,  are  the  mean  and  standard  deviation  of 
z,’s. 

Note  that  as  estimator  for  R,  z  has  much  less  bias 
than  R.  There  is  no  closed  form  expression  for  bootstrap 
confidence  interval  for  the  ratio  of  two  means  problem. 
However,  a  crude  bootstrap  procedure  can  be  applied 
directly  to  {Yi,Xi)  to  obtain  confidence  interval. 

Let  d  =  {ei.ez, . .  ■  ,e„},  with  Cj  =  {yi,Xj},  i  = 
1,2,  ...,n. 

1.  Draw  independent  bootstrap  samples,  dj^,...,d^ 
by  sampling  from  {ci  ,63,  ■ .  ■  ,c„}  with  replacement; 

2.  Calculate  from  d j , . . . ,  d^ ,  the  statistics 

R-  =  ^  and  5^  = 

where  R1  =  y* /x‘  is  calculated  from  d? ; 

3.  Draw  a  bootstrap  sample,  say,  d*  and  compute 
y"/x*;  Regard  d*  as  original  data,  repeat  step  1 

and  2  to  obtain  S*^  and  Q*  =  ^ 
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4.  Repeat  step  3  B  times,  and  obtain,  say,  >  •  •  •  1  Qb  ; 

5.  Sort  QI,...,Qb  construct  bootstrap  distribu¬ 
tion  G*; 

6.  Let  z*  and  zJ_o,  be  respectively  ath  and  (1  —  a)th 
percentiles  of  G*,  then  tf[a]  =  R—z*S  and  tf[l— a]  = 
R  — zj_„5  are  the  endpoints  of  a  level  (1  —2a)  100% 
confidence  intervd  for  R. 

The  rationade  behind  the  crude  bootstrap  is  using 
bootstrap  distribution  G*  to  approximate  the  distribu¬ 
tion  of  R  and  approximation  is  enhanced  by  considering 
standardized  R.  The  precision  of  bootstrap  intervals  de¬ 
pend  on  A  and  B,  which  are  80  and  1000  respectively 
in  this  study.  In  practice,  A  in  the  range  of  25  to  100 
will  give  reasonable  results.  There  is  little  gain  in  preci¬ 
sion  past  A  =  100.  Guideline  for  determining  B  value  is 
stated  in  [2]  (p.l81).  Table  7  contains  the  coverage  re¬ 
sults  of  the  two  methods  based  on  observations  from  200 
experiments.  The  true  confidence  level  is  90%.  Jackknife 
confidence  intervals  are  easier  to  compute,  but  crude 
bootstrap  intervals  provide  much  improvement  in  cov¬ 
erage  probability.  An  interactive  program  implement¬ 
ing  the  bootstrap  algorithm  mentioned  above  is  available 
from  the  authors. 


produces  reasonable  probability  coverages  for  the  ratio 

of  means  in  the  regenerative  process. 
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Table  7  -  Estimated  Coverage  Results  for  Model  4 
(Steady-state  M/M/1  Queue) 


Sample  Size 

64 

128 

Jackknife 

Crude  Bootstrap 

.630  ±.056 
.735  ±.051 

.700  ±.053 
.800  ±.046 

Sample  Size 

256 

512 

Jackknife 

Crude  Bootstrap 

.770  ±.049 
.825  ±.044 

.775  ±.049 
.872  ±.039 

5  Conclusion 

The  purpose  of  this  article  is  to  illustrate  the  useful¬ 
ness  of  bootstrap  methods  in  constructing  confidence  in¬ 
tervals  for  performance  measures  in  simulations.  For 
one  mean  problem,  normal  approximation  and  boot¬ 
strap  methods  are  equal  in  actual  probability  coverages 
and  computations  involved.  However,  only  bootstrap 
methods  can  capture  the  skewness  in  the  underlying 
distribution.  Either  BC  method  or  BCa  method  can 
be  recommended  in  place  of  the  normal  approximation. 
Computations  in  crude  bootstrap  procedure  are  inten¬ 
sive  but  manageable.  Compared  to  Jackknife  method, 
crude  bootstrap  appears  to  be  the  only  method,  which 
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Abstract 

This  tutorial  links  relational  database  concepts  to  prob¬ 
ability  concepts.  For  example,  the  fundamental  rela¬ 
tional  database  concepts  of  an  attribute  (column  head¬ 
ing),  a  relation  scheme  (unpopulated  table),  and  a  re¬ 
lation  (populated  table)  correspond  respectively  to  the 
probability  concepts  of  a  random  variable,  a  random  vec¬ 
tor,  and  a  multivariate  probability  distribution.  The 
relational  select  and  project  operators  correspond  re¬ 
spectively  to  finding  a  conditional  and  marginal  distri¬ 
bution.  Functional  dependencies,  multivalued  depen¬ 
dencies,  and  join  dependencies  correspond  respectively 
to  variable  transformations,  conditional  independencies, 
and  more  general  factorizations  of  distributions.  These 
connections  indicate  that  statisticians  may  know  more 
about  relational  databases  than  they  realize.  Beyond 
these  pedagogical  benefits,  these  connections  between  re¬ 
lational  databases  and  statistics  provide  a  bridge,  both 
directions  of  which  have  proven  to  be  useful  for  develop¬ 
ing  new  theory. 

1  Introduction 

This  tutorial  will  cover; 

•  Relational  database  concepts  and  probability  paral¬ 
lels  (Section  2). 

•  An  introduction  to  database  normalization  theory 
(Section  3). 

•  Parallel  theorems  for  consistent  databases  and  con¬ 
sistent  sets  of  marginal  distributions  (Section  4). 

•  Finding  closures  of  sets  of  multivalued  dependencies 
and  sets  of  conditional  independencies  (Section  5). 

•  Eliminating  intersection  anomalies  in  sets  of  con¬ 
ditional  independencies  and  sets  of  multivalued  de¬ 
pendencies  (Section  6). 

•  Concluding  remarks  (Section  7). 


This  tutorial  will  not  cover 

•  Anything  about  particular  relational  database  man¬ 
agement  systems. 

•  Network,  hierarchical,  or  object-oriented  database 
models. 

•  Distributed  databases. 

Basic  references  for  relational  databases  include  Codd 
(1970),  Date  (1986),  Maier  (1983),  and  Ullman  (1982). 
More  advanced  references  include  Fagin  (1977),  Fagin, 
Mendelzon  Ullman  (1982),  Beeri,  Fagin,  Maier  ii  Yan- 
nakakis  (1983),  and  Beeri  &  Kifer  (1986a,  b,  1987).  Con¬ 
nections  to  probability  theory  are  mentioned  in  Peat! 
(1988),  Geiger  &  Pearl  (1988,  1990),  Geiger,  Paz  k  Pearl 
(1991),  Lauritzen  &  Spiegelhalter  (1988),  and  Thoma 
(1989). 

2  Database  Concepts  and 
Probability  Parallels 

This  section  defines  the  basic  database  concepts  and  the 
parallel  probability  concepts.  The  definitions  are  given 
in  parallel  because  famili^lrity  with  the  probability  con¬ 
cepts  might  help  the  reader  understand  the  essential 
ideas  underlying  the  database  concepts.  Also,  as  sec¬ 
tions  4,  5,  and  6  show,  there  are  parallel  problems  and 
results  in  the  two  fields. 

A  relation  scheme  (table  skeleton)  i?  is  a  set  of  at¬ 
tributes  (column  headings).  A  relation  (table)  over  rela¬ 
tion  scheme  R  is  an  indicator  function  for  a  set  of  tuples 
(rows),  written  r[/?]:  r[/i](t)  =  1  if  the  tuple  t  is  in  the 
relation;  r[/2](<)  =  0  if  <  is  not  in  the  relation.  When 
storing  or  writing  out  a  relation,  it  is  common  to  list 
only  those  tuples  that  are  in  the  relation  (i.e.  that  have 

r[fl](0=l). 

The  parallel  concepts  in  probability  theory  are  a  ran¬ 
dom  vector  and  a  probability  distribution.  A  random 
vector  Y  is  a  set  of  random  variables.  A  distribution  for 
the  random  vector  Y  is  a  probability  function,  written 
p(V].  The  distribution  of  V  evaluated  at  v  is  written 
p[V^(u). 
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Airline  example  (Maier,  1983).  Relation  schedule 
contains  scheduling  information  for  an  airline.  Relation 
schedule  is  defined  over  the  relation  scheme  with  at¬ 
tributes  FLT,  FROM,  TO,  DEP,  and  ARR.  The  first 
tuple  in  schedule,  t\,  maps  FLT  into  84,  FROM  into 
O’Hare,  and  DEP  into  3:00pm.  The  projection  of  ti  onto 
{FROM,  TO}  is  t,[FROM,  TO]  =  (O’Hare,  JFK). 


schedule 


FLT 

FROM 

TO 

DEP 

ARR 

84 

O’Hare 

JFK 

3:00pm 

5:55pm 

109 

JFK 

Los  Angeles 

9:40pm 

2:42am 

117 

Atlanta 

Boston 

10:05pm 

12:43am 

213 

JFK 

Boston 

1 1 :43am 

12:45pm 

214 

Boston 

JFK 

2;20pm 

3:12pm 

The  conditional  distribution  of  V  given  X  —  x,  X  C 
V,  based  on  p[V],  written  p[V  \  X  =  x],  is  the  probability 
function; 

p[V  I  A'  =  x](i;)  =  p[P](v)/p[A']{x) 

if  p[A](x)  >  0  and  t>[A']  =  x;  p[V  \  A  =  x](t;)  =  0 
otherwise.  The  V-margin  of  the  A'  =  x  conditional  is 
written  p\Y  \  A'  =  xj. 

Airline  example.  The  following  table  shows  the 
data  for  flights  from  JFK. 

<frKOM-3rv.{schedule) 


FLT 

FROM 

TO 

DEP 

ARR 

109 

JFK 

Los  Angeles 

9:40pm 

2:52am 

213 

JFK 

Boston 

1 1 :43am 

12;45pm 

The  basic  operators  on  relations  are  a  projection  of  a 
relation  onto  a  subset  of  its  attributes,  a  selection  from 
a  relation  of  the  tuples  having  a  specific  value  for  a  sub¬ 
set  of  its  attributes,  and  a  join  of  two  relations.  These 
operators  correspond  to  a  marginal  distribution,  a  con¬ 
ditional  distribution,  and  a  product  of  two  functions. 

The  projection  of  the  relation  r[/?]  onto  A'  C  R, 
written  r[A']  or  jrx(r]R]),  is  the  indicator  function: 
r[A'](x)  =  1  if  there  is  a  tuple  i  such  that  r[/Z](<)  =  1 
and  <[A']  =  x;  r[A](x)  =  0  otherwise. 

The  marginal  distribution  of  X  C  V  based  on  p[V'], 
written  p[A'],  is  found  by  summing  p[V]  over  the  vari¬ 
ables  not  in  A';  that  is,  letting  Y  =  V  -  X, 

y 

Airline  example.  The  following  tables  show  the 
projections  of  schedule  onto  {DEP,  ARR)  and  onto 
FROM. 


fl-DEP,  ARR(schedule) 


DEP 

ARR 

3:00pm 

5:55pm 

9:40pm 

2:42am 

10:05pm 

12:43am 

1 1  ;43am 

12;45pm 

2:20pm 

3:12pm 

irfROM(schedule) 

FROM 

O’Hare 

JFK 

Atlanta 

Boston 


The  selection  from  the  relation  r[/?]  of  the  tuples  with 
X  =  X,  X  C  R,  written  r[/i  |  A  =  xj  or  o'x=r(»’(/Z]).  is 
the  indicator  function:  r[R  \  X  =  x](<)  =  1  if  /[A’]  =  x; 
r[R  I  A  =  x](<)  =  0  otherwise.  The  V-projection  of  the 
A  =  X  selection  from  r[/?]  is  written  r[T  |  A  =  xj. 


Let  ri[J?i]  and  r2[/?2]  be  relations  over  relation 
schemes  Ri  and  /?2-  Let  A'  —  Ri  —  R^,  V  =  /Zi  fl  /?2. 
Z  =  R-2  —  R\  .  The  join  of  rj  [/?i]  and  r2[/?2]  is  the  relation 
over  /ifjU/?2  =  XY Z  (XYZ  is  shorthand  for  XUYUZ) 
defined  by 

(ri  6<r2)[AV'Z](r,y,  :)  =  ri{Ay](i,y)  r2lyZ](y,;). 

Let  hi  [Pi]  and  h2[V2]  be  functions  over  variable  sets 
Vj  and  P2-  Let  A  =  V"i  -  V^,  >'  =  Pi  D  P2,  Z  =  P2  -  Vj. 
The  product  of  hi[Pi]  and  h2[P2]  is  the  function  over 
Pi  U  P2  =  AV'Z  defined  by 

(h,  ®  h2)[A'yZ](x,y,r)  =  h,[XY]{x,y)  h2[y^](y,*). 

Airline  example.  Relation  usable  contains  the 
equipment  requirements  for  each  flight.  Relation 
certified  contains  the  equipment  qualifications  for  each 
pilot.  Suppose  we  want  to  know  the  pilots  that  can  fly 
each  of  the  flights.  To  find  the  answer  to  this  query, 
we  first  form  options  =  usable  certified.  Then  we 
project  options  onto  FLT  and  PILOT,  providing  the  an¬ 
swer  to  the  original  query. 


usable  certified 


FLT 

EQPMT 

PILOT 

EQPMT 

83 

727 

Simmons 

707 

83 

747 

Simmons 

727 

84 

727 

Barth 

747 

84 

747 

Hill 

727 

109 

707 

Hill 

747 
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options  =  usable  t'<  certified 
FLT  EQPMT  PILOT 


83 

727 

Simmons 

83 

727 

Hill 

83 

747 

Barth 

83 

747 

Hill 

84 

727 

Simmons 

84 

Til 

Hill 

84 

747 

Barth 

84 

747 

Hill 

109 

707 

Simmons 

’'’FLT, 

FLT 

PI  LOT  (options) 

PILOT 

83 

Simmons 

83 

Hill 

83 

Barth 

84 

Simmons 

84 

Hill 

84 

Barth 

109 

Simmons 

Table  1  summarizes  the  basic  databjise/probability 
parallels  covered  to  this  point. 

Database  design  concepts  involve  putting  constraints 
on  the  data  that  can  populate  a  table.  There  are  three 
basic  kinds  of  constraints:  a  functional  dependency,  a 
multivalued  dependency,  and  a  join  dependency.  These 
correspond  to  three  constraints  on  probablity  distribu¬ 
tions:  transformation  constraints,  conditional  indepen¬ 
dencies,  and  general  factorization  constraints. 

A  relation  r[/?]  satisfies  the  functional  dependency  FD: 
A  — *  y  if  for  each  A'-value  x  with  r[A'](x)  =  1,  there 
is  a  unique  T-value  y,  such  that  r[y  |  A'  =  x](y)  =  I  if 
y  =  Pi  and  r[y  |  A'  =  x](y)  =  0  otherwise. 

A  distribution  p[V]  satisfies  the  transformation  con¬ 
straint  TC:  A'  — ♦  Y  if  for  each  A'-value  x  with  p(A](x)  > 
0,  there  is  a  unique  T-value  y,  such  that  p[Y  j  A'  = 
-’■](j/)  =  1  if  y  =  J/r  and  p[y  |  A'  =  x](y)  =  0  otherwise. 

Airline  example.  The  relation  schedule  satisfies 
the  FD  FLT  -  {FROM,  TO,  DEP,  ARR).  The  FLT- 
value  of  a  tuple  uniquely  determines  the  rest  of  the 
tuple.  The  relation  schedule  does  not  satisfy  the  FD 
FROM  —  TO  because  tjlFROM]  =  t^FROM]  =  JFK, 
but  <2[T0]  =  Los  Angeles  ^  Boston  =  <4[TO]. 

A  random  vector  V  satisfies  a  constraint  if  all  distri¬ 
butions  for  V  must  satisfy  the  constraint.  Likewise,  a 
relation  scheme  R  satisfies  a  data  dependency  if  all  rela- 


Table  1;  Basic  database  and  probability  parallels. 
DATABASE  PROBABILITY 

CONCEPT  CONCEPT 


Relation  scheme 
(table  skeleton)  R, 
a  set  of  attributes 
(column  names) 

Random  vector  V, 
a  set  of  random  variables 

Relation  (tablel 
over  R,  r[R], 
an  indicator  function 
for  a  set  of  tuples  (rows) 

Distribution 
for  V,  p[y], 
a  probability  function 

Projection  of  r[R] 
onto  A'  C  R. 
’rA'('‘[f?])>  or  r[A'] 

Marginal  distribution 
of  X  C  V. 

P[A1 

Selection  (»■[/?])■ 

or  r[/?  1  A'  =  x], 

A  C  R 

Conditional  distribution 

p[V  I  A'  =  x], 

A'  C  V 

Join  of  2  relations 

M  r2[R2] 

Product  of  2  functions 
hi[V,]^h^[V2] 

tions  over  R  must  satisfy  the  dependency. 

Airline  example.  The  functional  dependency 
FLT  — ►  {FROM,  TO,  DEP,  ARR)  remains  true  over 
time.  As  a  result,  FLT  is  a  candidate  key  for  the  re¬ 
lation  schedule. 

A  relation  r[R]  satisfies  the  multivalued  dependency 
MVD:  Z-^X  I  y  if 

r\XYZ]{xi,yu!)  r[A'yZ](x2,  yj,  z) 

=  r[A'yZ](xi,y2,r)  r[A'yZ](x2,  y,,  z). 

Similarly,  a  distribution  p[y]  satisfies  the  conditional 
independency  Cl:  A'iLy  |  Z  if 

p[A'yZ](xi,y,,z)  p(AyZ](x2,y2.z) 

=  p[Ay  Z](x, ,  y2,  z)  p[Ay Z](X2, y, ,  z). 

Multivalued  dependencies  are  equivalent  to  binary  join 
dependencies.  That  is,  a  relation  satisfies  an  MVD  iff 


.j 
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it  can  be  recovered  as  the  join  of  two  relations  defined 
on  “smaller”  relation  schemes.  In  symbols,  a  relation 
r[/?]  satisfies  MVD:  Z—<->X  \  Y  iff  there  exist  relations 
r\[XZ\  and  r^[YZ]  such  that  r  =  rj  X  r2:  that  is, 

r[XYZ]{x,y,z)  =  ri[XZ]{x,z)  r-i[Y Z]{y,z). 

If  such  rj'a  exist,  then  r[f2]  is  said  to  satisfy  the  binary 
join  dependency  BJD:  {XZ,YZ].  Also,  if  such  Vj's 
exist,  then  they  can  be  taken  to  be  ri[XZ]  =  r[XZ]  and 
r2[YZ]  =  r[YZ]. 

Similarly,  conditional  independencies  are  equivalent  to 
binary  factorization  constraints.  That  is,  a  probability 
distribution  satisfies  a  Cl  iff  it  can  be  recovered  as  the 
product  of  two  functions  defined  on  “smaller”  random 
vectors.  In  symbols,  a  distribution  p[V]  satisfies  Cl; 
A'iiy  I  Z  iff  there  exist  nonnegative  functions  hi[XZ] 
and  h2\YZ\  such  that  p  =  hi(2)  h^'.  that  is, 

p[XYZ]{x,  y,  z)  =  hi  [XZ](x,  z)  /.jfV Z](y,  z). 

If  such  hj'a  exist,  then  p[V]  is  said  to  satisfy  the  binary 
factorization  constraint  BFC:  iZ>{XZ,Y Z). 

Airline  example.  Relation  service  satisfies  the 
MVD:  FLT— .DAY  OF  WEEK  |  PLANE  TYPE  be¬ 
cause  service  =  servday  N  servtype  where  servday  = 
sert)icc(FLT,  DAY  OF  WEEK]  and  servtype  = 
5crt)icc[FLT,  PLANE  TYPE].  Relation  service2, 
which  has  the  same  two  projections  as  service,  does  not 
satisfy  this  MVD  because  it  lacks  the  tuple  (106,  Thurs¬ 
day,  1011). 


servday 

irFLT,  DAY  OF  WEEK  (service) 

FLT  DAY  OF  WEEK 


106 

106 

204 

Monday 

Thursday 

Wednesday 

servtype 

=  "’FLT,  PLANE  TYPE(fiCrvice) 

FLT 

PLANE  TYPE 

106 

747 

106 

1011 

204 

707 

204 

727 

FLT 

service2 
DAY  OF  WEEK 

PLANE  TYPE 

106 

Monday 

747 

106 

Thursday 

747 

106 

Monday 

1011 

204 

Wednesday 

707 

204 

Wednesday 

727 

A  distribution  p[V]  satisfies  the  factorization  con¬ 
straint  FC:  ®V,  V  =  {Vi,  -  -,  V*}.  Vj  C  V,  if  there  exist 
nonnegative  functions  /»i[Vi], .  .. /»t[Vjt]  such  that 

P[^]  =  /»i[Vi]  ®  ®  ht[Vk]- 


FLT 

service 

DAY  OF  WEEK 

PLANE  TYPE 

106 

Monday 

747 

106 

Thursday 

747 

106 

Monday 

1011 

106 

Thursday 

1011 

204 

Wednesday 

707 

204 

Wednesday 

727 

A  relation  r[/2]  satisfies  the  join  dependency  JD;  H, 
H  =  {Ri, ...,  Rk},  Rj  C  R,  if  there  exist  relations  ri[Ri], 
...,  r*[Rjfc]  such  that 


r[R]  =  ri[Ri]  >4  •  •  ■  M  ri:[Rt]. 


The  set  of  margins  V  is  a  hypergraph  over  V.  Factoriza¬ 
tion  constraints  generalize  loglinear  models  which  must 
be  strictly  positive. 

Example.  The  relation  r[ABC]  satisfies  the  JD:  M 
{AB,  BC,  AC)  but  does  not  satisfy  any  nontrivial  MVD. 


r[ABC]  r[Afl]  r[BC]  r[AC] 


A 

B 

C 

A 

B 

B 

C 

A 

C 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

2 

1 

2 

2 

2 

1 

2 

2 

3 

3 

2 

3 

3 

3 

2 

3 

3 

3 

4 

3 

3 

3 

4 

3 

4 

4 

4 

5 

4 

4 

4 

5 

4 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

If  such  rj’s  exist,  then  they  can  be  taken  to  be  rj[Rj]  = 
r[Rj],j  =  1,  The  set  of  relation  schemes  is  a  set 
of  subsets  of  R\  in  other  words,  R  is  a  hypergraph  over 
R. 
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Finally,  databases  correspond  to  sets  of  marginal  dis¬ 
tributions. 

A  database  scheme  over  attribute  set  A  is  a  set  of  rela¬ 
tion  schemes  with  attributes  from  R:  Tl—  {Ri, Rk], 
Rj  C  R.  The  database  scheme  72  is  a  hypergraph 
over  R.  A  database  over  database  scheme  72  is  a  set 
of  relations  over  the  relation  schemes  in  72:  r(72]  = 
{ri[72i],...,ri[72jt]}- 

A  set  of  margins  of  random  vector  V  is  a  set  of  random 
vectors  with  variables  from  V:  V  =  {Vi, Vk),  Vj  C  V. 
The  set  of  margins  V  is  a  hypergraph  over  V.  A  set  of 
marginals  over  set  of  margins  V  is  a  set  of  distributions 
for  the  margins  in  V:  p[V]  =  {pi[Vi],  ...,pt[l4]}. 

Table  2  summarizes  this  second  collection  of  parallels. 
There  are  many  more  parallels  between  database  theory 
and  probability.  Sections  4,  5,  and  6  discuss,  very  briefly, 
three  parallel  problems  and  solutions. 


Table  2;  Further  database  and  probability  parallels. 


DATABASE 

CONCEPT 

PROBABILITY 

CONCEPT 

Functional  dependency 

A'  -  Y 

A',  Y  CR 

Variable  transformation 

A  Y 

A,  Y  C  V 

Multivalued  dependency 

A  1  Y 

A,  Y,ZCR 

Conditional  independency 
AiLY  1  Z 

A,  Y,ZCV 

Join  dependency 

M  72 

72  =  {Ru...,Rk],  Rj  C  R 

Factorization  constraint 
®V 

V={V^,...,Vk},VjCV 

MVDs  are  binary  JDs 

M  {XZ,YZ} 

CIs  are  binary  FCs 
®{AZ,YZ} 

Database  scheme  over  R 

72  =  {Ri,...,Rt},  Rj  C  R 

Set  of  margins  of  V 
V={Vi . Vk},  Vj  C  V 

Database  over  72 
r[72]  =  (ri[/2i] . ri[/2i]} 

Set  of  marginals  on  V 
pM  =  {pi[Vi],...,p*[Vt]} 

3  A  Brief  Introduction  to 
Normalization  Theory 

Here  is  a  very  tiny  bit  of  normalization  theory,  an  im¬ 
portant  standard  topic  in  database  theory  with  no  useful 
parallels  in  probability  theory.  The  basic  reason  for  nor¬ 
malizing  a  database  is  to  automatically  eliminate  possi¬ 
ble  inconsistencies  that  might  otherwise  arise. 

A  set  of  attributes  K  is  a  candidate  key  of  R  if  K  — >  R. 
One  of  the  candidate  keys  of  relation  R  is  designated  the 
primary  key  and  the  other  attributes  are  called  non-keys. 

A  set  c  f  attributes  Y  is  fully  dependent  on  another  set 
of  attributes  X  if  A'  — ►  Y  and  there  is  no  Z  C  A'  such 
that  Z  ^  Y .  If  there  is  such  a  Z  then  Y  is  partially 
dependent  on  A’. 

A  set  of  attributes  Z  is  transitively  dependent  on  A' 
if  there  is  a  Y  such  that  A  — ►  Y  and  Y  Z. 

The  normal  forms  are: 

•  First  Normal  Form  (INF):  A  relation  is  in  INF  if 
all  the  values  in  its  tuples  are  atomic.  There  are  no 
repeating  groups. 

•  Second  Normal  Form  (2NF):  A  relation  is  in  2NF  if 
it  is  in  INF  and  every  non-key  is  fully  dependent  on 
the  primary  key.  A  relation  in  2NF  has  no  partial 
dependencies. 

•  Third  Normal  Form  (3NF):  A  relation  is  in  3NF  if 
it  is  in  2NF  and  no  non-key  is  transitively  depen¬ 
dent  on  the  primary  key.  A  relation  in  3NF  has  no 
partial  or  transitive  dependencies.  All  the  non-keys 
in  a  3NF  relation  are  mutually  independent  (i.e.  no 
nonkey  is  functionally  dependent  on  another  non¬ 
key). 

•  Boyce/Codd  Normal  Form  (BCNF):  A  relation  is  in 
BCNF  if  every  FD  is  a  consequence  of  the  candidate 
keys.  Date:  “Each  field  must  represent  a  fact  about 
the  key,  the  whole  key,  and  nothing  but  the  key.” 

•  Fourth  Normal  Form  (4NF):  A  relation  is  in  4NF  if 
every  MVD  is  a  consequence  of  the  candidate  keys. 
All  dependencies  (MVDs  and  FDs)  of  a4NF  relation 
are  FDs  from  a  candidate  key  to  another  attribute. 
A  relation  is  in  4NF  if  it  is  in  BCNF  and  all  its 
MVDs  are  FDs. 

•  Fifth  Normal  Form  (5NF):  A  relation  is  in  5NF  if 
every  JD  is  a  consequence  of  the  candidate  keys. 
5NF  is  also  called  project/join  normal  form. 

There  are  rules  for  converting  database  schemes  that 
do  not  satisfy  normal  forms  into  ones  that  do.  The  in¬ 
terested  reader  should  consult  Maier  (1983)  or  Ullman 
(1982),  for  example. 
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4  Parallel  Theorems  for  Consis¬ 
tent  Databases  and  Consistent 
Sets  of  Marginal  Distributions 

It  was  noted  earlier  that  database  schemes  and  sets  of 
margins  are  hypergraphs.  There  are  strong  connections 
between  relational  databases  and  graph  theory  and  be¬ 
tween  probability  theory  and  graph  theory.  Often,  prop¬ 
erties  of  databases  and  properties  of  probability  distribu¬ 
tions  are  determined  by  the  underlying  graphical  struc¬ 
ture.  This  section  gives  an  example  of  the  kind  of  parallel 
results  that  arise  because  of  these  connections  to  graph 
theory. 

A  database  r[7Z]  is  pairwise  consistent  if  r,[iZj  PI  Rj]  — 
rj[72jn72j].  A  databeise  r[7i]  is  globally  consistent  if  there 
exists  a  single  relation  r[/2]  such  that  =  r[/2j];  if 

such  an  r[i?]  exists,  then  it  can  be  taken  to  be  r(/Z]  = 
ri[fii]  M  •  •  •  tX  r/bfiZfc]. 

A  set  of  marginals  p[V]  is  pairwise  consistent  if  Pi[V<  0 
V^]  =  pj  [1/j  n  V^].  A  set  of  marginals  p[V]  is  globally  con¬ 
sistent  (or  extendable)  if  there  exists  a  single  distribution 
p[V]  such  that  P;  [Vy]  =  p[V)]- 

Consider  the  following  two  examples. 

Example  1  (Vorob’ev,  1962).  Let  V  =  {AB,BC, 
AC}  be  a  set  of  margins  of  the  random  vector  ABC.  Let 
P  =  {P1.P2.P3}  the  set  of  marginals  over  V  defined 
by 

pi[A5](0,0)  =  pi[AB](l,l)=  1/2, 

P2[BC](l,0)  =  P2[flC](0,l)=  1/2, 

and 

P3[AC](0,0)  =  P3[AC](1,1)=  1/2. 

There  is  no  distribution  p{ABC]  such  that  p[AB]  = 
P\[AB\,  p[BC]  =  P2[BC],  and  p[i4C]  =  p3[AC\.  Such 
a  p[AB^  would  have  p[>lBC](0,0,0)  =  0  because 
P2[BC](0,0)  =  0,  and  p[ABC](0, 0, 1)  =  0  because 
P3[AC](0,  1)  =  0,  so  p[Afl](0,0)  =  0,  contradicting 
pi[i4B]  =  1/2.  This  same  example  can  be  given  as  a 
database  example  with  p’s  replaced  by  r’s  and  1/2’s  re¬ 
placed  by  Ts. 

Example  2.  Let  R  =  {ABD,BCD,BCE]  be  a 
database  scheme  over  ABCDE.  For  every  pairwise  con¬ 
sistent  database  r  =  {ri,r2,r3}  over  Tl,  there  is  a  sin¬ 
gle  relation  r[ABCDE\  such  that  r[ABD]  =  ri[ABD], 
r[BCD]  =  VilBCD],  and  r[BCE]  =  r3[BCE\.  The  par¬ 
allel  statement  holds  for  probability  distributions. 

The  difference  between  these  examples  is  that  the  hy¬ 
pergraph  in  Example  2  is  acyclic  but  the  one  in  Example 
1  is  not  acyclic.  There  are  many  ways  to  define  an  acyclic 
hypergraph.  The  following  definition,  referred  to  as  the 
running  intersection  property,  does  not  require  defini¬ 
tions  for  any  other  concepts.  A  hypergraph  H  is  acyclic 


if  its  elements  can  be  ordered  so  that  for  each  t  =  2, 
there  is  a  j  <  i  with 

Hin(Hi  Li  -UHi-i)  C  Hj. 

The  two  results  can  now  be  stated. 

Vorob’ev  (1962)  proved  that  every  pairwise  consistent 
set  of  marginals  over  a  set  of  margins  V  is  extendable 
if  and  only  if  the  hypergraph  V  is  acyclic  (see  also  Lau- 
ritzen.  Speed  &  Vijayan,  1984). 

Beeri,  Fagin,  Maier  &  Yannakakis  (1983)  proved  the 
parallel  result  for  relational  databases:  that  is,  every 
pairwise  consistent  database  over  a  database  scheme  71 
is  globally  consistent  if  and  only  if  the  hypergraph  71  is 
acyclic. 

5  Closures  of  Sets  of  MVDs  and 
Sets  of  CIs 

Let  M  be  a  set  of  MVDs  over  R.  The  closure  M*  of  M 
is  the  set  of  MVDs  implied  by  the  MVDs  in  M,  that  is,  if 
a  relation  satisfies  the  MVDs  in  M,  then  it  also  satisfies 
the  MVDs  in  M*. 

The  closure  M*  of  M  can  be  found  as  follows.  Let 
Sm(^)  =  {Y  C  R-  X  ■.  X-^Y  6  M*}.  The  de¬ 
pendency  basis  of  X,  DEPj^(A'),  is  the  partition  of 
R  —  X  such  that  Y  6  Ej^(X)  iff  V  is  a  union  of  sets  in 
DEPj^(X).  DEPj^(X)  can  be  found  using  the  'ollow- 
ing  algorithm: 

(0)  Start  with  partition  V  =  {V  —  X}. 

(1)  If  y  G  P  and  there  is  an  MVD:  Z^W  in  M  s  -h 
that  y  n  Z  =  0,  then  replace  Y  by  the  2  sets  y  n  'V 
and  Y  -W. 

(2)  Repeat  (1)  until  it  no  longer  changes  7^. 

The  final  partition  is  DEPjy^(X). 

Example.  Let  M  =  {BC— ^AD  |  E,  BD-^A  \ 
CE}.  To  find  DEPm(BCD),  (0)  let  V  =  {AE),  (1) 
AE  G  V,  BC—*-*AD  \  E  G  M,  and  BC  D  AE  =  0,  so 
replacing  AE  by  AE  D  AD  =  A  and  AE  H  E  =  E  gives 
DEPm(5CD)  =  {A,E}. 

Geiger  &  Pearl  (1988,  1990)  and  Geiger,  Paz  &  Pearl 
(1991)  proved  that  the  same  algorithm  can  be  used  to 
find  the  closure  of  a  set  of  conditional  independencies 
with  respect  to  arbitrary  (i.e.  not  necessarily  strictly 
positive)  distributions.  They  also  derived  a  graph-based 
approach  for  finding  the  closure  with  respect  to  strictly 
positive  distributions. 
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6  Eliminating  Intersection 
Anomalies 

The  two  CIs  XJiy  I  Z  and  XILZ  \  Y  imply  the  third 
XALYZ  for  strictly  positive  distributions.  The  same  is 
not  true  for  arbitrary  distributions.  For  example,  the 
distribution  p[XyZ](0, 0, 0)  =  p[XyZ](l,  1, 1)  =  1/2, 
p[XY Z]{x,y,z)  =  0  otherwise,  satisfies  the  first  two  of 
these  CIs,  but  does  not  satisfy  the  third.  The  set  of  CIs 
{A'Jiy  I  Z,  XJLZ  I  y}  is  said  to  have  an  intersection 
anomaly. 

After  reviewing  several  statistical  arguments  that 
were  flawed  because  they  ignored  intersection  anoma¬ 
lies,  Dawid  (1979)  showed  that  it  is  possible  to  fix  up 
this  anomaly  by  adding  a  variable  W  such  that  W  is 
functionally  determined  by  each  of  y  and  Z  individu¬ 
ally  (i.e.  Y  ^W,  Z  -*W)  and  XALYZ  \  W.  The  vari¬ 
able  W  represents  the  information  that  Y  and  Z  have  in 
common. 

Beeri  and  Kifer  (1986a,  b,  1987)  and  others  have  writ¬ 
ten  extensively  about  the  same  issue  for  sets  of  MVDs. 
Their  solution,  which  has  implications  for  database  de¬ 
sign,  is  the  same  as  Dawid ’s.  They  only  apply  the 
method  to  sets  of  MVDs  that  do  not  have  split  left  hand 
sides,  so  after  eliminating  intersection  anomalies  they 
have  a  conflict-free  set  of  MVDs  which  is  equivalent  to  a 
single  (acyclic)  JD. 

7  Concluding  Remarks 

This  tutorial  reviewed  basic  parallels  between  database 
theory  and  probability  theory.  It  discussed  three  par¬ 
allel  problems  and  corresponding  solutions  in  the  two 
areas.  It  mentioned  some  of  the  connections  to  graph 
theory  which  provide  another  bridge  between  results  in 
database  theory  and  those  in  probability  theory.  For  ex¬ 
ample,  acyclic  databases  and  decomposable  models  (dis¬ 
tributions  that  satisfy  acyclic  factorization  constraints) 
have  many  desirable  properties  (Beeri,  Fagin,  Maier  & 
Yannakakis,  1983;  Darroch,  Lauritzen  Sc  Speed,  1980). 

One  particularly  interesting  connection  concerns  the 
positivity  condition  of  the  Gibbs-Markov  equivalence 
theorem.  It  is  possible  to  relax  the  positivity  condition 
using  concepts  from  relational  database  theory.  Results 
on  this  topic  and  others  will  be  given  in  future  papers. 
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Abstract 

A  new  form  of  regression  is  applied  to  the  problem  of 
modeling  the  flow  of  water  and  contaminants  through 
soil.  In  a  fashion  analogous  to  nested  ANOVA,  the 
new  method  parametrizes  global  distributional 
structure  separately  from  local  structure.  A  blind 
study  is  conducted  to  assess  the  precision  of  mixing 
parameter  estimation  as  a  function  of  depth.  It  is 
shown  that  accurate  estimates  of  the  regression 
relationship  can  be  obtained  from  a  sample  of  size 
n=1000  for  mixing  parameters  and  all  other 
component  parameters,  with  the  exception  of  the 
standard  deviation  of  small  components  which  have 
large  variances. 

It  is  shown  that  the  hydraulic  conductivity,  transport, 
or  infiltration  of  water  borne  contaminants  through 
the  vadose  zone  can  be  effectively  modeled  and 
simulated  by  the  mixing  parameter  regression 
methods.  ^ 
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1.  Introduction 

This  paper  concerns  the  description  of  soil 
characteristics  by  means  of  a  new  type  of  regression. 
The  value  of  this  form  of  regression  stems  from  its 
capacity  to  separate  global  from  local  variability 
through  the  use  of  interactive  graphical  analysis. 

As  discussed  by  Wagenet  (1986,  p.  340 ) : 

It  appears  that  a  stochastic,  rather  than  a 
deterministic,  model  approach  should  be 
considered  when  modeling  water  and 
chemical  movement  in  the  unsaturated  zone. 

This  will  represent  no  small  change  in  our 
conceptualization  of  basic  principles  of 
pesticide  modeling.  The  resulting  models  will 
almost  certainly  not  represent  basic  processes 
in  fundamental  mechanistic  terms,  but  will 
instead  will  represent  the  soil-water-pesticide 
system  in  statistical  terms. 
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In  line  with  Wagenet's  assertion  we  propose  to 
simultaneously  describe  global  and  local  variability 
by  means  of  mixing  parameter  regression. 

Soil  can  be  considered  to  be  both  a  mixture  in  the 
chemical  and  in  the  statistical  sense.  However, 
unlike  the  uniformity  inherent  in  the  molecules  of 
compounds  the  substances  such  as  clays,  sands, 
pebbles  and  cobbles  which  are  described  below  do 
not  have  uniform  characteristics.  Rather  than 
uniformity,  there  is  a  degree  of  variability  in 
hydraulic  conductivity,  density  and  pore  size,  as  well 
as  in  many  other  characteristics  of  any  given 
substance,  which  is  comparable  to  the  obvious 
variability  between  substances. 

For  the  stochastic  and  other  models  with  which 
water-borne  contaminant  flows  are  modeled,  it  is  of 
great  value  to  parametrize  the  between-substance 
variability  separately  from  the  within  substance 
variability.  In  the  two-substance  case,  consider 
mixture  model  (1)  of  the  conditional  probability 
density  f(ylx)  of  flow  variate  value  Y=y  at  a  given 
value  X  of  key  variate  X: 

P(x)  fi{ly-iti(x)l/o,(x))/aj(x)  + 

{l-P(x)  }f2{[y-P2(x)l/a2(x))  /a2(x)  (1) 

where  f^  and  f2  are  probability  densities  which  are 
symmetric  about  zero  ,  regression  functions  Pi(x),  Ojfx), 
P2(x)  and  a2(x)  describe  the  local  substance-specific 

variation  with  value  x  of  variate  X  (below  we  will 
specifically  refer  to  x  as  a  depth)  and  finally,  and  most 
importantly,  mixing  parameter  regression  function  P(x) 
expresses  the  relationship  between  pure  global 
variation  of  the  Y  variate  and  the  value  of  the  key 
variable  X.  Local  variation  within  contiguous  and 
homogeneous  soil  subregions,  pockets,  is  described 
by  the  functions  f j  and  f2 ,  where  these  functions  will 
be  assumed  to  be  functionally  independent  of  x.  In 
realistic  applications,  there  will  of  course  be  both 
more  than  two  classifications  of  soil  typx;s  and  X  will 
be  vector  rather  scalar-valued.  However,  both  for 
purposes  of  illustration,  and  because  the  methodology 
illustrated  below  is  at  the  cutting  edge  of  what  is  now 
computationally  feasible,  only  the  two  component 
scalar  case  will  be  discussed. 

Previous  statistical  literature  which  discusses  mixture 
model  regression  focuses  upon  the  relationship 
between  the  two  means  ,  p,(x)  and 
Quandt  (1958,1972),  Kieffcr  (1978),  and  Quandt  and 


Ramsey  (1978)  refer  to  this  model  as  "switching 
regression".  The  distinction  between  switching 
regression  and  mixing  parameter  regression  is  central 
to  the  theme  of  this  paper.  A  switching  regression 
curve  describes  the  overall  distribution,  and  hence 
one  form  of  variation  of  the  distribution  in  its  entirety. 
On  the  other  hand,  the  mixing  parameter  regression 
function  P(x)  describes  pure  global  variation.  In  the 
case  of  soil  constituents  such  as  sand  or  cobbles  it 
quantifies  the  variation  of  a  constituent  in  its  entirety, 
independent  of  variation  within  the  constituent  itself. 
For  example,  it  can  be  used  to  indicate  how  flow  is 
affected  by  the  change  from  the  proportion  of  cobbles 
found  at  one  depth  X=xj,  to  the  proportion  found  at 
a  second  depth  X=X2.  If  the  parameters  of  the  cobble- 
specific  density  also  change  with  depth,  this  change 
will  affect  the  overall  model  through  parameters  other 
than  P(x),  specifically,  P|(x),  a|(x).  (Below  we  will 
use  cobbles  in  examples  of  the  new  mixture 
methodology  to  emphasize  that  the  material  whose 
properties  are  being  studied  cannot  always  be 
brought  to  the  surface  and  examined  directly,  but 
instead,  must  often  be  examined  in  situ.) 

For  purposes  of  illustration,  suppose  fj  describes  local 
variation  within  deposits  of  cobbles  and,  and  f2 
describes  local  variation  within  deposits  of  sand. 
(Below,  we  will  refer  to  the  former  as  high  density 
and  the  later  as  low  density  jxxkets.)  It  is  extremely 
convenient  to  separate  the  estimation  of  the  function 
P(x)  from  the  estimation  of  the  parameters  which 
form  part  of  f| 's  and  f2 's  arguments.  The  stochastic 
models  used  to  simulate  flow  processes  can  be 
systematically  constructed  when  these  three  functions 
are  considered  separately.  In  addition,  in  a  fashion 
analogous  to  nested  analysis  of  variance  (Fraser, 
1958,  pp.141-150),  ANOVA,  this  formulation  can 
facilitate  studies  of  the  relative  importance  of  local  (in 
nested  ANOVA,  within)  versus  global  (in  nested 
ANOVA,  between)  variation  of  soil  characteristics  for 
the  prediction  of  water  and  contaminant  flow  through 
soil  ( Ray  and  Turk  1991). 

2.  Soil  Configurations 

Three  basic  typos  of  soil  configuration  are  shown  in 
Figures  la,  b  and  c.  Figure  la  depicts  a  distribution 
where  small  pockets  of  high  density  soil  are  uniformly 
distributed.  (In  terms  of  mixture  model  (1),  the 
proportion  of  high  density  soil  at  depth  x  is 
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parametrized  by  P(x).)  Because  of  this  uniformity,  the 
soil  configuration  shown  in  Figure  la  can  be  analyzed 
by  switching  regression  methods. 

Figure  lb  shows  a  mixture  which  is  similar  to  that 
shown  in  Figure  la,  where  large  pockets  of  high 
density  soil  are  uniformly  distributed.  Figure  Ic 
shows  a  configuration  where  the  proportion  of  high 
density  soil  increases  with  depth  variate  x.  Note  that 
within  each  pocket,  in  other  words,  locally,  hydraulic 
conductivity,  infiltration,  or  movement  might 
reasonably  be  assumed  to  have  the  same  distribution. 
However,  globally,  the  pocket-type  (in  other  words 
the  type  of  mixture  subpopulation)  might  vary  as  a 
function  of  depth  x.  (Of  course,  it  is  also  possible  that 
other  parameters  besides  P(x)  are  non-constant 
functions  of  x.) 

Even  though  depth,  as  depicted  in  Figures  labc,  is 
usually  treated  as  a  vertical  coordinate,  below,  the 
letter  x  will  be  used  to  represent  a  specific  depth. 
This  choice  of  representation  was  required  by  the 
convention  that  x  represent  the  independent  variate 
within  a  regression  relationship.  Consequently,  in 
the  scatter  diagram  and  estimated  density  displays  in 
this  paper,  the  depth  variate  will  vary  horizontally, 
along  the  x-axis. 


Figure  lb  Mixture  of  two  soil  types:  large  pockets  of 
high  density  soil. 


Figure  la.  Mixture  of  two  soil  types:  small  pockets  of  Figure  Ic.  Mixture  of  two  soil  types:  high  density  soil 
high  density  soil.  increases  with  depth. 
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3.  Theory 

The  methodological  approach  used  to  estimate  the 
function  P(x)  is  based  on  the  distinction  between 
mixing  parameter  and  other  forms  of  regression.  For 
example,  in  the  bivariate  case  what  Quandt 
(1958,1972)  refers  to  as  switching  parameter 
regression  involves  the  decomposition  of  a  mixture  of 
two  bivariate  normal  densities,  where  the  constant 
value  P(x)=p,  specified  how  much  one  density 
contributes  to  the  overall  mixture.  (For  example,  in 
Section  4.3  of  (Quandt  and  Ramsey,  1978,  the  mixing 
parameter,  which  these  authors  call  "X"  is  not 
considered  to  be  functionally  related  to  the  of  the 
random  variate,  which  these  authors  represent  by  "e.") 
The  decomposition  of  bivariate  normals  is  discussed 
by  Tarter  and  Silvers  (1975)  and  Titterington,  Smith 
and  Makov  (1985)  pp.  142-145.  The  two  variate 
special  case  for  which  P(x)=p,  and  therefore  P(x)  is  a 
horizontal  line,  is  the  only  situation  where  mixing 
parameter  regression  is  equivalent  to  the 
decomposition  of  bivariate  normal  densities. 

In  the  two  dimensional  normal  special  case,  mixing 
parameter  regression  is  parametrized  in  terms  of  the 
conditional  normal  density  and  not  in  terms  of  the 
bivariate  normal.  Because  it  is  a  shorter  and  less 
technical  term,  below  we  will  refer  to  estimated 
conditional  densities  as  "slices."  At  a  given  point  x, 
P(x)  determines  the  proportion  of  a  slice  attributable 
to  one  component  of  a  given  two-component  mixture. 
For  example,  in  certain  applications  where  x  is  a 
depth  measurement,  P(x)  measures  the  proportion  of 
one  of  the  following  list  of  soil  components  that  is 
present  within  a  specific  soil  layer:  Clayey  soils, 
sandy  textured  soils,  soils  with  large  pores  (cobbles, 
large  rocks,  void  root  channels,  worm  holes  etc..) 
Helling  and  Gish  (1986).  While  the  standard  bivariate 
normal  mixture  model  considered  by  Tarter  and 
Silvers  (1975)  describes  a  situation  where  the  means  of 
the  individual  subpopulations  of  soil  components 
change  with  depth,  it  does  not  describe  a  situation 
where  the  proportions  of  soil  components  vary  with 
depth. 

Because  the  conditional  estimation  or  slicing  process 
is  central  to  mixing  parameter  regression,  the  crucial 
step  of  P(x)  determination  is  the  estimation  of  density 
slices.  Once  a  slice  is  estimated  at  a  depth  x,  this  slice 
is  then  separated  into  its  constituent  components.  This 
is  accomplished  by  using  the  univariate  procedures 
described  by  Kronmal  (1964),  by  means  of  which  a 
density  can  be  estimated  using  a  kernel  transform 


which  can  reduce  the  overlap  of  mixture  components. 
As  modified  by  Tarter  (1979a)  Section  4  and  described 
by  Titterington,  Smith  and  Makov  (1985)  pp.  138-140, 
this  procedure  will  below  be  called  the  X  method.  The 
following  example  describes  these  steps: 

Figure  2  is  a  scatter  diagram  constructed  from  10(X) 
data  points  simulated  as  part  of  the  study  described 
in  the  next  section.  The  dark  vertical  line  shown  in 
this  figure  is  the  intersection  of  the  plane  through  the 
frequency  axis,  upon  which  Figures  3-5  are  drawn, 
and  the  plane  within  which  scatter  diagram  points 
are  depicted. 


Figure  2.  Scatter  diagram  of  a  1000  point  sample  from 
the  simulaHon  experiment  described  in  Section  3. 

The  first  step  in  the  process  of  mixing  parameter 
regression  is  the  estimation  of  the  joint  density  of  the 
dependent  and  independent  variate.  Methodology 
described  in  Tarter  and  Lock  (1991)  was  used  for  this 
purpose.  After  the  bivariate  density  is  constructed, 
equations  (2.22)  and  (2.23)  of  Tarter  (1979b)  were  used 
to  estimate  the  conditionals  at  a  sequence  of 
independent  variate  values.  The  slice  taken  at  point 
x=600  inches  through  the  line  shown  in  Figure  2  is 
shown  in  Figure  3. 

The  spurious  bumps  shown  at  the  right  side  of  Figure 
3  are  due  to  the  use  of  a  curve  estimation  procedure 
based  on  fixed  kernel  methodology.  As  discussed  by 
Tarter  and  Lock  (1991)  Section  3,  methods  of  Breiman, 
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variate  in  Figure  2  given  the  value  x=600,  i.e.  along  Figure  3  with  standard  deviations  of  subcomponents 


the  solid  line  shown  in  Figure  2. 


reduced. 


Meisel  and  Purcell  (1977)  and  Tarter  Silvers  (1975) 
Expression  (2.13)  can  be  used  as  the  basis  of  variable 
kernel  techniques.  However,  because  the 
transformation  procedures  (which  we  presently  use  to 
accomplish  the  same  goals  towards  which  variable 
kernel  approaches  are  directed)  would  have 
complicated  this  paper,  we  used  basic  fixed  kernel 
procedures  to  conduct  the  study  described  below. 

Figure  4  depicts  the  application  of  the  Kronmal  (1964) 
and  Titterington,  Smith  and  Makov  (1985)  procedure 
to  the  curve  shown  in  Figure  3.  This  involved  the 
sample-size-controlled  modification  of  the  Fourier 
transform  of  the  estimated  density  to  obtain  a  curve 
estimate  where  the  standard  deviations  of  all  mixture 
components  are  all  reduced  by  a  user  selected 
constant  X. 

Once  the  mixture-component-specific  variances  are 
reduced  sufficiently  so  that  the  resulting  curve  has  no 
component  overlap,  one  or  another  of  the  components 
can  be  excised  and  mixing  parameter  P(x)  can  be 
estimated  for  the  slice  at  X=x.  Finally,  the  |X)st- 
excision  curve  can  have  its  component  standard 
deviation  increased  by  -X  (which  brings  it  back  to  the 
neutral  setting).  The  effect  of  this  step  is  shown  by 
the  solid  curve  in  Figure  5.  Once  a  component  of  the 
slice  at  x  is  isolated,  the  component-specific  mean  and 


variance,  Pi(x)  and  Ojfx),  can  be  estimated.  The 
dashed  curve  in  Figure  5  was  fit  to  the  mixture 
component  using  the  Pi(x)  and  Ojfx)  values 
estimated  from  the  data  shown  in  Figure  2.  For  the 
sample  being  illustrated,  the  only  parameter  which 
could  not  be  accurately  estimated  is  the  standard 
deviation  of  the  smaller  density  component,  a2(x). 

4.  Blind  study 

The  graphical  selection  of  the  fX)int  at  which 
components  are  sufficiently  drawn  apart  for  the 
separation  step  to  be  instituted  is  interactive. 
Consequently,  the  authors  designed  a  simple  single 
blind  trial  to  assess  the  performance  of  the  mixing 
parameter  regression  estimation  procedure.  One 
author  devised  and  implemented  simulation 
procedures  and,  independently,  a  second  author 
performed  the  steps  of  the  interactive  parameter 
estimation  process  without  any  knowledge  of  the 
parameters  selected  for  the  simulated  samples. 
Twenty  different  samples,  where  each  sample 
corresponded  to  a  reasonable  choice  of  the  curve  P(x) 
as  well  as  comparable  regression  functions  associated 
with  other  parameters,  were  generated.  (As  a  pwint  of 
reference,  it  is  the  spxjcial  case  where  P(x)  is  a 
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Figure  5.  Leftmost  component  isolated  from  the 
conditional  density  shown  in  Figure  3.  The  dashed 
line  is  a  normal  curve  fit  to  the  component. 

horizontal  line  and  where  the  component  mean 
functions  Hj(x)  and  linear,  that  corresponds 

to  switching  regression. ) 

Two  forms  of  simulation  modeling  were  used  in  the 
blind  study.  In  the  first  set  of  simulations,  the  mixing 
parameter  regression  model  was  used  directly.  Here, 
models  in  which  the  mixing  parameter  P(x)  varied  as 
a  linear  function  of  x  were  used  to  simulate  soils  in 
which  the  proportion  of  high  density  soil  increased 
directly  with  soil  depth  (as  illustrated  by  Figure  Ic). 
Models  in  which  P(x)  had  many  mcxles  or  bumps 
(technically,  where  the  derivative  P'(x)  had  many 
roots)  were  used  to  simulate  soils  of  the  sort  that  had 
a  uniform  dispersion  of  small  pockets  or  lenses  of 
high  density  soil.  A  second  set  of  simulation 
experiments  in  which  soil  configurations  were 
modeled  stochastically  by  randomly  assigning  the 
location  and  size  of  high  density  soil  pockets  was  also 
conducted. 

Because  switching  regression  and  most  mixture 
decomposition  methods  are  applicable  to  normal 
data,  normal  random  deviates  were  used  to  describe 
variation  about  parameter  values.  The  interactive 
process  of  estimating  all  five  regression  curves  P(x), 
Pj(x),  Ojfx),  IA2^x)  and  02(x)  takes  approximately 
twenty  minutes  using  an  IBM  Personal  System /2 


Model  70  386  with  a  sample  of  one  thousand  points. 
The  consistent  performance  of  the  estimation  method 
in  determining  model  parameters  demonstrated  the 
feasibility  of  using  the  mixing  parameter  regression 
model  with  the  PC-based  computational  hardware 
which  is  available  today.  The  following  is  a 
representative  selection  of  trials,  which  correspond 
respectively  to  the  three  soil  pocket  configurations 
described  in  Section  2. 

5.  Discussion  and  Trials 

Cressie  (1988)  described  local  and  global  components 
of  environmental  variation  using  the  stochastic  model 
Y(x)  =  p(x)  +  W(x)  +  e(x),  where  p(.)  is  the 
deterministic  mean  structure  by  which  large  scale 
variation  is  modeled,  W(.)  is  a  zero  mean  intrinsically 
stationary  process  used  to  represent  small  scale 
variation  and  e(.)  is  a  zero  mean  white  noise  process 
independent  of  W  used  to  represent  measurement 
error. 

Universal  kriging  and  intrinsic  random  function  of  order  k 
methods  have  been  used  to  separate  large  scale  from 
local  environmental  variation.  Cressie  (1986) 
compared  these  methods  and  proposed  the  median 
polish  method  for  the  estimation  of  large  scale 
variation  or  drift.  These  methods  are  based  on  the 
removal  of  large  scale  variation  in  order  to  IcKally 
predict  the  value  of  Y  at  a  point  (or  small  region)  of 
contiguous  X  values.  The  estimated  variogram,  an 
estimate  of  the  functional  E[Y,^},(x)  -  Yj(x)]^  /  2  ,  is 
the  resulting  descriptive  estimator  of  local  variability. 
While  the  variogram  targets  local  variability,  the 
estimates  obtained  using  mixing  parameter  regression 
yield  separate  local  and  global  variability  statistics. 

Because  of  the  above  distinction  it  is  of  interest  to 
study  the  behavior  of  the  variogram  estimator 
obtained  from  the  examples  shown  in  Figures  la  and 
lb  (which  illustrate  the  situation  where  there  is  no 
systematic  mean  value  drift)  and  the  Figure  Ic 
example. 

Figure  6  displays  the  robust  variogram  estimates 
(Cressie  and  Hawkins,  1980)  for  the  three  examples 
shown  in  Figures  la,  lb  and  Ic.  The  mixing 
parameter  regression  estimators  for  these  three 
examples  are  shown  in  Figure  7.  It  is  notable  that  the 
three  examples  are  easily  distinguished  from  one 
another  by  the  mixing  parameter  methcxl  but,  to  all 
extents  and  purposes,  yield  indistinguishable 
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Figure  6.  Variogram  estimates  of  data  simulated  to 
have  the  characteristics  described  in  Figures  la,  b  and 
c. 


Figure  8.  Comparison  of  estimated  and  true 
regression  relationship  between  the  mixing  parameter 
and  depth. 


Figure  7.  Regression  of  the  mixing  parameter  versus 
depth  for  the  three  types  of  data  described  in  Figures 
la,  b  and  c. 

variogram  estimators. 

An  example  of  an  actual  mixing  parameter  regression 
curve  is  shown  in  Figure  8  as  is  the  population  curve 
which  mixing  parameter  regression  estimates.  These 
curves  correspond  to  the  soil  configuration  depicted 
in  Figure  Ic.  Although  the  shape  (linear)  and  the  slope 
can  be  estimated  with  great  precision  from  a  sample 
of  n=1000  points  Figure  8  also  shows  a  bias  which  is 
characteristic  of  the  computational  and  statistical 
approaches  which  are  currently  available.  Bias  in  the 


mixing  parameter  regression  process  tends  to  be  the 
result  of  the  following  two  closely  related  causes;  (1) 
the  process  which  first  separates  distributional 
components  and  then,  in  effect,  snaps  them  back  into 
their  non-variance-reduced  form  does  not  eliminate 
all  overlap  effects.  (2)  The  curve  estimation 
methodology  upon  which  the  component  reduction 
and  isolation  methodology  is  bas^  is  in  no  way 
tuned  to  perform  well  for  mixture  parameter 
regression  applications.  In  particular,  as  discussed  by 
Tarter  and  Lock  (1991),  with  very  few  exceptions, 
there  is  a  tendency  for  both  kernel  and  series 
approaches  to  inflate  the  variance  of  estimated 
densities. 

Use  of  an  everywhere  non-negative  fixed  bandwidth 
kernel  must  inflate  this  variance  by  a  constant 
approximately  equal  to  the  variance  of  the  kernel. 
(Expression  10  of  Tarter  and  Raman  (1971)  indicates 
that  such  an  estiniate  is  the  convolution  of  the  kernel 
and  a  density  whose  variance  is  identical  to  n/(n-l) 
times  the  sample  variance.)  Hence,  we  are  presently 
exjjerimenting  with  the  equivalent  of  variable 
(Brieman,  L.,  Meisel,  W.  and  Purcell,  E.,  1977)  and 
somewhere-negative  kernel  methods  which  will  yield 
mixing  parameter  regression  estimators  which  have 
reduced  bias  properties.  It  is  hoped  that  when 
appropriate  methods  are  found  it  will  be  possible  to 
obtain  accurate  mixing  parameter  regression  curves 
with  sample  sizes  considerably  smaller  than  the 
n=1000  sized  samples  used  in  the  above  experiments. 
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1.  INTRODUCTION 

Software  testing,  or  debugging,  is  one  of  the  most 
important  components  in  software  development.  It 
has  been  estimated  that  in  many  projects,  the  time 
accounted  for  debugging  can  be  around  50%  of  the 
total  development  effort.  There  is  an  obvious 
question  in  the  debugging  process,  that  is  when  to 
stop.  One  naive  answer  is,  of  course,  the  process 
continues  until  there  are  no  bugs  (errors)  in  the 
program.  However,  this  is  a  very  difficult  goal  to 
achieve.  For  most  commercial  software,  the  release 
requirement  is  usually  not  100%  error  free,  but  an 
acceptable  error  rate.  Again,  to  determine  when  a 
debugging  process  has  reached  this  stage  is  difficult. 
The  best  bet  is  often  an  estimate  of  the  future  error 
rate.  However,  the  accuracy  of  the  estimate  may  not 
be  very  high,  depending  on  the  estimation  formula, 
and  more  seriously,  on  the  assumptions  that  the 
formula  is  based  upon.v^t  there  be  m  faults  in  the 
software  and  their  failure  rates  be 

Aj  >A2>  •  •  •  >'^m>0-  (11) 

Then  there  are  assumptions  on  equal  failure  rate  for 
all  the  fault.s(eg.  [2,  13,  14]),  and  unequal  failure 
rates  (eg.  [3,  8]).  Or  a  model  based  on  failure  time 
during  testing  instead  failure  rates  of  faults.  Among 
them  are  the  basic  execution  time  and  logarithmic 
Pou5.son  execution  time  models(sec  Musa  et  al  [  10, 
11]).  However,  these  models  arc  usually  very  difficult 
to  verify. 

In  this  paper,  we  try  to  find  the  optimal  stopping 
rules  based  on  testing  cost  and  fault  penalty  after 
the  software  is  release  under  very  little  on  the  model 
assumption.  The  debugging  procedure  is  to  test  N 


cases  during  each  testing  stage,  then  to  debug  the 
software  according  to  the  testing  result  and  make  a 
decision  whether  to  stop  testing  or  not.  Here  we  do 
not  restrict  to  the  case  N  =  1  as  in  the  usual 
sequential  analysis,  because  for  large  programs,  the 
testing  and  debugging  may  not  be  done 
simultaneously.  Debugging  is  performed  only  when  a 
large  number  of  programs  has  been  tested.  Optimal 
stopping  rules  in  software  testing  has  been  notices  in 
some  recent  literature.  For  example,  Ross[14] 
considered  a  stopping  rule  based  on  an  estimate  of 
the  future  failure  rate.  It  differs  from  ours  in  cost 
criterion.  We  feel  that  to  know  when  to  stop,  one 
must  know  the  relative  costs  between  testing  and 
penalty  due  to  future  failure.  Dalai  and  Mallows  ([6], 
[7])  considered  loss  function  that  can  be  equated  to 
costs,  but  their  rules  depend  on  some  prior 
assumption  of  the  A’s.  Rasmussen  and  Starr]  13], 
Nayak[12],  and  Goudie[9]  has  also  considered  loss 
function,  but  their  loss  is  basically  a  function  of  the 
remaining  number  of  bugs  instead  of  the  future 
failure  rate.  It  seems  that  the  main  concern  of  the 
software  reliability  is  on  the  future  failure  rate  rather 
than  the  number  of  bugs.  Of  course,  the  two  are 
equivalent  if  equal  failure  rates  are  assumed  for  each 
bug.  We  feel  that  this  assumption  is  probably  not 
realistic. 

2.  Theory 

In  this  section,  only  the  most  reasonable  cases  are 
presented.  Possible  generalizations  to  more 
complicated  situations  are  given  in  §4.  The 
assumption  on  A  is  (1.1)  with  m  and  all  the  A  values 
unknown. 
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Let  Cj  denote  the  cost  of  testing  one  case,  Co  be  the 
cost  (penalty)  when  an  error  is  encountered  in  the 
released  software,  and  M  be  the  expected  cases  to  be 
run  by  the  consumers. 

To  develop  theory,  let  I(x)  be  the  indicator  function 

{1  if  X  is  true 

0  if  X  is  false, 

Xj(n)  =  number  of  times  that  the  ith  bug  is 
encountered  at  the  end  of  the  nth  test  period, 

and  the  remaining  failure  rate 
m 

U(n)=  £  AjI(Xi(n)=0). 
i=l 

Then  if  the  program  is  released  after  the  nth  testing 
period,  the  cost  is 

C(n)  =  C2M-U(n)  +  CjnN.  (2.1) 

Our  purpose  is  to  find  the  optimal  stopping  rule  rj) 
such  that 

ECW  <  EC(r), 

for  any  stopping  rule  r.  Here  E  denotes  the 
expectation.  Note  that  a  stopping  rule  is  a  decision 
that  depends  only  on  the  sampling  information  from 
the  past,  not  the  future.  To  put  it  in  the  usual 
notation,  a  stopping  rule  r  is  a  random  variable  such 
that  the  event  (  7-=n  )  f  Tjj,  where  Tjj  is  the  sigrna 
field  generated  by  all  the  previous  samples  up  to  n. 
To  find  the  optimal  stopping  rule,  we  first  assume 
that  all  the  A’s  are  known  and  use  Theorem  3.3  in 
Chow,  Robbins,  and  Siegmund[5].  In  order  to  follow 
the  theorem  more  easily,  we  let  the  payoff  function 
g(n)=-C(n)  and  the  equivalent  optimal  stopping  rule 
now  is  to  find  V'  such  that 

Eg(t/')  >  Eg(r),  (2.2) 

for  any  stopping  rule  r.  Without  loss  of  generality, 
we  may  let  Cj  =  l  and  C2M=c.  Hence, 

g(n)  =  -  cU(n)  -  nN. 

Using  our  notation,  we  restate  Theorem  3.3  of  [5].  If 
the  set 

An  =  {  E(g(n+l)i  %  )  <  g(n)  } 
is  monotonically  increasing  with  respect  to  n  and 


liminf  |  g“^(n)dP  =  0, 
ip>n 

then  (  2.2)  is  true  for  all  r  satisfying 

liminf  |  g~(n)dP  =  0, 
r>n 

where  tp  is 

the  first  n  >  1  such  that  g(n)  > 

E(g(n+l)|Tn). 

It  can  be  shown  that  the  optimal  stopping  rule  xp  is 
to  stop  at 

the  first  n  >  1  such  that  Aj[l  —  (1  —  Aj)^] 

i=l 

•  I(Xj(n)=0)  <  N/c.  (2.3) 

For  small  A’s,  a  good  approximation  for  (2.3) 
becomes 

the  first  n  >  1  such  that 
m 

£A2.I(Xi(n)=0)<l/c.  (2.4) 

i=l 

Since  the  A’s  are  unknown,  the  rp  defined  in  (2.3) 
cannot  be  put  into  practice,  but  it  tells  us  that  if 
there  is  a  good  estimate  of  the  left  side  of  the 
inequality  in  (2.3)  or  (2.4),  we  may  be  close  to  the 
optimal  stopping  rule.  Moreover,  if  any  stopping  rule 
that  can  almost  reach  the  optimal  value  EC(V’) 
obtained  from  xp  when  the  A’s  are  known,  it  must  be 
nearly  optimal.  From  the  simulation  study  to  be 
presented  in  the  next  section,  many  nearly  optimal 
situations  are  identified. 

Let  9  =  E  £  A2jI(Xj(n)=0).  Then  it  can  be 
shown  that  i=l 

m  „ 

9=  £a2;(1-A;)"^ 
i=l 

It  is  known  in  the  literature  (eg.  [2],  [4],  [15]) 

that 

m 

E  £l(Xi(n)=2) 
i=l 

i=l 
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By  ignoring  the  small  difference  between  (1- 
~  ^  (l-Aj)*'^,  we  can  estimate  6  by 

y"l(X.(n)=2)  =  — -B„ 
nN(nN-l)  '  r  '  '  nN(nN-l) 

where  is  the  number  of  doubletons,  i.e.,  the 
number  of  bugs  that  have  been  encountered  exactly 
twice  up  to  stage  n.  Thus,  a  reasonable  adaptive  rule 
^  for  the  recapture  debugging  procedure  is  to  stop  at 

n 

the  first  n  such  that  ■  B_  <  1  /c. 

nN(nN)  “  ~  ' 

(2.5) 


The  expected  cost  of  this  rule  is  denoted  by  EC(ti'). 
When  the  standard  debugging  procedure  is  usrd,  the 
number  of  doubletons  up  to  stage  n  has  to  be 
estimated,  because  all  the  previous  bugs  before  stage 
n  have  been  removed.  Let  s^,  .-nd  bjj  denote  the 
number  of  singletons  and  doubletons  discovered  at 
test  period  n.  They  are  observable. 

Let  Sjj  denotes  the  number  of  singletons  encountered 
up  to  stage  n,  S„  and  B^  can  be  estimate  recursively 
by 


®ii  -  ®n-l  (*  N(li)  T  + 


m 

Tj^  =  ^  Aj,  with  m=100.  Four  configurations  for  A 
i=l  ‘ 
are  used; 

a)  rapidly  decreasing  A  (exponential  rate):  Aj=K/2\ 
i —  l,2,...,m, 

b)  moderately  decreasing  A  (Zipfs  Law):  Aj=K/i, 
i=l,2,...,m; 

c)  slowly  decreasing  A  (constant);  Aj=K,  for  all  i=l, 
.  .  .,m; 

d)  random  A  (following  [14]):  Aj=  K  Uj,  Uj  is  a 
random  ntimber,  i=l,...,m, 


where  K  is  the  normalization  constant  so  that 


i=l 

Since  T^  is  small,  small  N  will  make  the  test  very 
ineffective.  We  choose  N=100,  300,  500.  Since 
N=100  is  sometimes  too  small  for  the  case  Tj^=0.01, 
we  hold  our  decision  if  there  is  no  doubletons  before 
the  5th  period.  Similarly,  we  hold  our  decision  for 
N=300  and  500  if  there  no  doubletons  in  the  first 
period.  The  value  c=C2M/cj  can  vary  considerably 
due  to  different  real  situation.  We  feel  that  Cj  =  l, 
C2=100,  and  M=10^  is  a  reasonable  middle  ground. 
“Aus,  three  values,  c=10^,  10®,  and  lO^are  used. 
One  hundred  simulations  were  done  for  each 
combination  of  c,  N,  and  T^. 


(2.6b)  The  following  results  have  been  observed. 


The  two  formulae  are  derived  from  the  maximum 
likelihood  principle.  Thus,  for  the  standard 
debugging  procedure,  the  stopping  rule  0  is  to  stop 
at 

the  first  n  such  that —r^T^r^—r  Bn  <  1/c.  (2.7) 

nN(nN-l)  “  -  ' 

Again,  we  denote  the  cost  under  this  rule  by  EC(^). 
Analytic  study  on  the  performance  of  (2.5)  and  (2.7) 
seems  to  be  very  difficult.  Simulations  are  used  to 
evaluate  their  performances. 

3.  SIMULATION  STUDY 

The  stopping  rules;  optimal  (2.3), 
approximation  (2.4),  adapted  to  recapture  debugging 
procedure  (2.5).  and  adapted  to  the  standard 
debugging  proce*  ure  (2.7)  are  compared.  In  standard 
software  development,  the  failure  rate  should  not  be 
very  high  at  the  testing  stage.  Three  values,  0.10, 
0.05,  and  0.01  are  assigned  for  the  total  failure  rate 


1)  The  first  thing  that  surprises  us  is  that  there 
is  little  difference  between  (2.5)  and  (2.7).  After 
detailed  check  into  the  stopping  process,  we  found 
that  this  was  due  to  large  variaiion  in  Bj^,  the 
number  of  doubletons.  Singletons  are  more  stable  in 
the  sample.  Thus,  the  estimated  Bjj  from  singletons 
ran  be  as  effective  as  the  doubletons. 

2)  It  is  actually  unfair  to  compare  (2.5)  and 
(2.7)  with  (2.3),  because  in  (2.31  all  the  A’s  have  to 
be  known.  There  is  a  tremendous  prior  information 
difference  between  (2.3)  and  the  two  adaptive 
stopping  rules.  However,  the  simulations  show  that 
in  most  situations,  especially  for  cases  b,  c,  and  d, 
the  adaptive  methods  perform  extremely  well.  It  is 
unlikely  that  in  these  situations  any  other  stopping 
rule  can  beat  them  without  any  prior  information  on 
A.  At  least  we  can  say  that  they  are  nearly  optimal. 

3)  From  the  expected  cost  point  of  view,  the 
initial  total  failure  rate  T^  has  less  influence  than 
the  sizes  of  A. 

4)  Case  a  shows  the  biggest  discrepancy  in 
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costs  between  (2.5),  (2.7),  and  (2.3).  The  reason  is 
easy  to  see.  If  we  have  E  Aj^  <  1/c  in  the  beginning, 
then  no  testing  is  the  best  strategy  when  the  A’s  are 
known.  But  under  the  real  situation  that  the  A’s  are 
unknown,  it  will  take  a  considerable  number  of 
testing  samples  to  discover  this  fact.  Thus,  the 
adaptive  methods  cost  considerably  more  than  the 
optimal  rule  (2.3).  Another  situation,  such  as  the 
rapidly  decreasing  A  case,  is  that  although  E  Aj^  < 
1/c  is  not  true  for  all  the  A’s  in  the  beginning,  the 
A’s  are  dominated  by  a  few  large  ones  and  once  they 
are  removed,  E  Aj^  <  1/c  is  satisfied  by  the  rest  of 
the  A’s.  Since  the  large  A’s  can  be  discovered  pretty 
easily,  they  can  be  removed  in  the  very  beginning. 
The  debugging  process  can  then  be  stopped  because 
the  A’s  are  known.  The  adaptive  methods  again  have 
to  identify  this  fact  at  considerable  cost. 

5)  The  final  costs  vary  little  due  to  the  test  size 
N. 

4.  CONCLUDING  REMARKS 

1)  When  the  A’s  are  equal,  (2.3)  is  equivalent  to 
(2.1)  in  Rasmussen  and  Starr[13].  We  also  repeated 
the  simulation  for  the  cases  they  considered.  Out 
results  confirm  their  results. 

2)  In  the  present  study,  the  testing  size  N  is  assumed 
to  be  the  same  in  all  the  testing  stages.  An 
interesting  question  would  be  what  happens  if  we 
vary  N.  One  thing  we  noticed  is  that  the  proof  of  the 
optimality  of  (2.3)  is  no  longer  valid.  It  seems  to  be 
a  significant  contribution  if  the  tester  can  choose  the 
optimal  sample  size  at  each  stage. 

3)  From  the  derivation  of  (2.3),  we  can  extend  the 
result  to  a  more  general  cost  function  i.e.,  let  the 
cost  for  doing  x  tests  be  f(x).  Then  if  Af(x)  =  f(x+l) 
—  f(x)  is  a  nondecreasing  function,  then  Theorem 
3.3  of  [5]  holds  and  the  optimal  stopping  rule 
becomes  to  stop  at 

the  first  n  >  1  such  that  Aj[l-(1-Aj)^] 

i=l 

I(Xi(n)=0)<[f((n+l)N)-f(nN)]/c. 

The  assumption  of  Af(x)  being  nondecreasing  is 
reasonable  when  delay  in  releasing  the  software  is 
considered  as  a  cost. 
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Abstract:  In  software  reliability  theory  many 
different  models  have  been  proposed  and  investi¬ 
gated.  Most  of  these  models  assume  perfect  repair 
and  constant  software  size.  Both  restrictions  over¬ 
simplify  reality  in  a  huge  way.  In  the  model  we  will 
discuss  in  this  paper,  we  have  tried  to  overcome  both 
simplifications  in  such  a  way  that  statistical  infer¬ 
ence  is  still  possible. 

1.  Introduction 

The  reliability  of  hard-  and  software  is 
sometimes  of  vital  importance  to  their  users.  Dur¬ 
ing  the  recent  Gulf  war  a  patriot  missile,  which 
was  stationed  in  Turkey,  was  fired  by  accident, 
because  of  a  bug  in  the  software.  Obviously  also 
in  case  of  less  delicate  computer  applications  cus¬ 
tomers  want  a  high  degree  of  reliability  to  be 
guaranteed.  The  modelling  of  the  evolution  of  the 
reliability  of  a  piece  of  software  undergoing 
debugging  will  be  the  subject  of  this  paper. 

In  the  next  section  we  will  give  some  back¬ 
grounds  and  classical  assumptions  of  software 
reliability  theoiy.  In  the  third  section  we  describe 
the  PGIR  model,  a  new  model  with  interesting 
features,  and  we  will  suggest  how  to  estimate  the 
model  parameters.  In  section  4  we  will  shed  some 
light  upon  the  huge  amount  of  extensions,  that 
are  possible,  starting  from  this  model;  we  will 
define  a  class  of  regression  models.  Finally,  in 
the  fifth  and  last  section  we  give  some  concluding 
remarks.  This  short  paper  just  tries  to  give  some 
ideas  and  results.  More  details,  derivations  and 
proofs  can  be  found  in  Van  Pul  (1991). 

2.  Backgrounds  and  classical  assumptions 

Let  us  consider  the  following  test  experi¬ 
ment.  A  very  large  computer  program  is  executed 
during  a  fixed  exposure  period,  say  [0,t).  Inputs 
are  selected  "at  random"  from  the  input  space. 


that  is,  they  are  generated  in  such  a  way  that  they 
are  representative  for  the  operational  profile.  For 
each  input  the  program  either  produces  the 
correct  output  or  a  software  failure  is  detected; 
the  software  produces  a  wrong  answer  or  no 
answer  at  all.  After  the  detection  of  a  failure  the 
CPU-clock  is  stopped  and  the  software  is  sent  to 
a  team  of  debuggers.  The  failure  time  and  possi¬ 
bly  other  failure  data  are  observed.  After  the  bug 
is  found  and  fixed,  the  CPU-clock  is  restarted 
again  and  testing  continues  with  a  new  input  until 
time  T  is  reached. 

Efforts  in  describing  the  evolution  of  the 
reliability  of  computer  software  during  test  and 
development  resulted  in  the  proposal  of  dozens 
of  new  models  over  the  past  twenty  years.  An 
important  class  of  such  models  is  the  so-called 
class  of  Error-Counting  and  Debugging  (EC&D) 
models.  This  class  consists  of  models  that  are 
based  on  the  test  experiment  described  above 
(with  only  the  failure  times  as  test  data)  and  some 
strong  assumptions: 

(Al)  Perfect  repair:  no  new  faults  are  introduced 
during  a  repair  with  probability  1. 

(A2)  Fixed  software  size:  there  is  no  addition  of 
new  software  during  testing. 

(A3)  Independence  of  faults:  faults  (and  hence 
their  failure  times)  are  independent. 

Although  all  three  assumptions  seem  to  be  rather 
unrealistic,  they  form  a  framework  on  which 
many  models  are  built.  The  most  elementary  and 
oldest  software  reliability  model  is  the  modd  of 
Jelinski-Moranda  (1972),  introduced  almost 
twenty  years  ago.  In  this  model  the  failure  rate  of 
the  program  is  assumed  to  be  at  any  time  propor¬ 
tional  to  the  number  of  remaining  faults  and  the 
repair  of  each  fault  does  make  the  same  contribu¬ 
tion  to  the  decrease  in  failure  rate.  Denoting  n  (/) 
for  the  observed  counting  process,  we  find  for  the 
failure  intensity  function  A(r)  the  following 
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expression: 


HO 


N 


nil-) 


(1) 


with  model  parameters  N,  the  number  of  faults 
initially  present  in  the  software,  and  <#>,  the 
occurrence  rate  per  fault,  which  can  also  be  inter¬ 
preted  as  the  test  efficiency.  Musa  (1975),  Little- 
wood  (1980)  and  many  others  have  built  more 
.sophisticated  models,  for  technical  reasons,  how¬ 
ever,  generally  restricted  by  assumptions  (Al)- 
(A3). 


As  there  exist  no  perfect  testers  and  pro¬ 
grammers,  there  will  always  be  a  positive  chance 
of  introducing  new  faults,  while  repairing  an  old 
one.  Secondly,  development  and  testing  of 
software  usually  takes  place  simultaneously  in 
practice.  Because  the  addition  of  software,  that 
has  never  been  tested  before,  certainly  will  have 
an  effect  on  the  reliability,  it  seems  reasonable  to 
take  also  software  growth  during  testing  into 
account.  Furthermore  certain  bugs  will  prevent 
parts  of  the  software  to  be  inspected  and  there¬ 
fore  will  hide  other  bugs,  thus  violating  the 
assumption  of  independence  of  faults.  Dropping 
(A3),  however,  would  cause  the  mathematical 
problem  to  become  highly  complicated  and 
almost  untractable. 


In  the  next  section  we  introduce  a  new 
model,  the  Poisson  Growth  and  Imperfect  Repair 
(PGIR)  model.  We  combined  the  modelling  of 
imperfect  repair  and  software  growth  in  a  natural 
way.  Furthermore  to  a  certain  extent  the  model 
will  account  for  dependencies  between  faults. 
The  model  has  attractive  statistical  properties, 
besides. 


addition  of  new  software.  It  seems  reasonable  that 
A',  is  in  some  sense  proportional  to  the  "size"  of 
the  total  change  in  the  software  at  time  7",.  We 
therefore  assume  that  Nj  is  a  stochastic  variable, 
Poisson  distributed  with  mean  fiKjJ  =0,],..., 
where  {i  a  parameter.  We  consider  the  testing 
process  during  [0,t],  observing  say  n{r)  faults.  Let 

«(0  :=  ’’2(r,<r),  fE[0,T],  (2) 

i  =  1 

the  number  of  failures  detected  (faults  deleted) 
during  [0,/]  and  let 

N(l)  :=  "2Ni(7',<f),  rG[0,T],  (3) 

1=0 

the  number  of  faults  introduced  during  [0,/], 
where 

Ni  POfQiK,).  (4) 

We  assume  that  the  failure  intensity  A,  like  in  the 
Jelinski-Moranda  model,  at  any  time  is  propor¬ 
tional  to  the  remaining  number  of  faults,  that  is: 

A(0  <j,  [N(t -)-«(/ -)J,  /G[0,T],  (5) 

where  <j>  denotes  the  constant  occurrence  rate  per 
fault.  With  use  of  the  data  (T, ,/(,),  /  =0,  l,...,n(T), 
obtained  from  the  experiment  as  described  above, 
one  can  estimate  the  parameters  of  the 

underlying  PGIR  model.  We  will  use  the  max¬ 
imum  likelihood  estimation  (MLE)  procedure  for 
this  purpose.  The  following  lemma  will  be  very 
useful: 

Lemma  1: 

For  all  m  Gl^  and  all  (oo.ai.  '  ’  '  am)GR7  * we 
have: 


3.  The  PGIR  model. 

Let  T>0.  We  consider  a  test  experiment  as 
de.scribed  in  the  introduction.  Let  7’o;  =  0  and 
T,,/ -  1,2, ...  the  failure  times  of  the  occurring 
failures.  Repair  takes  place  immediately  after  a 
failure  is  detected.  For  reasons  of  convenience  the 
addition  of  new  software  takes  only  place  at  the 
failure  times  T,.  Due  to  the  correction  of  a  fault 
and  eventually  due  to  the  addition  of  new 
.software  at  time  T",,  there  is  a  change  in  the 

software  of  size  /f,,/  -0,1 . 7  he  Kj  are  hence  the 

known  outcomes  of  some  deterministic  software 
measure,  e.g.  lines  of  code,  complexity,  number  of 
loops  or  subroutine-calls  .  At  time  T,  apart  from 
deleting  one  fault,  A,  new  faults  are  introduced, 
partly  due  to  bad  repair  and  partly  due  to  the 


Z  Nf)- 


N„=0  Nq\  \,=0 


2  (Ao  +  ^i-1) 


N, 


A,! 


^2^(Ao  +  A,+  +N„-m)^ 


—  Of)  (ao  +  aj)  •  •  • 
•••  (ao  +  a^+ 


(6) 


Proof: 

The  result  follows  immediately  with  natural  induc¬ 
tion. 


We  now  return  to  the  derivation  of  the 
likelihood  function  for  the  PGIR  model,  as 
described  by  (2)-(5).  Aalen  (1978)  showed,  that 
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the  likelihood  function  for  estimating  the  parame¬ 
ters  of  the  intensity  function  of  a  counting  pro¬ 
cess,  observed  on  a  fixed  time  interval  [0,t]  is 
given  by: 

l(i?)  LM<t>-,To,Tu  ■  ■  ■ 


^  —adr—T'S 

=  n  luplK/e  X 

i  =0  j  =0  •' 

X  exp  .#.2(T-r,)-/i2/:,(l-e  ^'0  (12) 

1=1  1=0 


"  it; 

=  n\(ri-)exp(-/\(s)di).  (7) 

;-l  -Q 

As  the  N,  are  independent  Poisson  distributed 
stochastic  variables  with  mean  /i/f,  we  have 

p(N)  :=  P  [n,.  =  ;V,.,/ =  !...« (t) 

=  n  (8) 


.■=o[  (AT,)!  J 
and  defining: 

fl,  :=  (9) 

:=  {No+ - ^^,-1=0,  (10) 

for  i  =  l,...,n(T),  we  obtain  the  likelihood  function 
under  the  finer  filtration  (observing  also  the  sizes 
of  the  software  changes)  by  summing  Aalen’s 
expression  (7)  over  all  possible  realisations  of  the 
Ni  multiplied  by  their  joint  probabilities  (8): 

L^(p,<l>-(Ti,Ki),i  =0,  1,...,/i(t))  = 

00  00  00  00  )  y 

2  2  ■  ■  ■  2  2  p{N)  KN) 

No=lN,=b,  N.„y  ,=b,„-,N,M=0 

n(T)  n(T) 

=  exp  <t>  2  (t—T,)  —  p'Z  Kj  X 

( = 1  1=0 

"  ao^°  "  0\^' 

X  2  Nq-^  2  (No+N^-l)—, - 

v„=i  ®  TVo!  A'.=fc,  “  '  '  N,\ 


2  (iVo  +  -  +  ^«(T,-«W)-^ 


We  note  that  if  TVq-I-  ■  ■  ■  +iV,_|  =  l,  that  is,  if 
hj  —  ]  (  so  we  have  to  sum  Nj  from  1  to  oo),  then 
the  coefficient  (TVq +  ••• +^/ ~0  f^e  /-th  sum 
equals  zero  for  A^,  =  0.  So  we  can  take  all  lower 
bounds  equal  to  zero  and  use  lemma  1  to  get: 

Ki),i  =  0, 1 ,  (t))  = 

"(t)  nW 

=  exp  <#.2  (t-T,)  -  P'S.  Kj  X 


We  now  take  the  logarithm  of  the  likelihood  func¬ 
tion  (12),  set  the  partial  derivatives  equal  to  zero 
and  solve  the  system  of  two  ML-equations, 
finding  expressions  for  the  ML  estimators  Ji: 

P  :=  -p. - -  (13) 

SKi  1-e  ^ 

1=0  L  J 

and  4>  is  the  solution  of  ^(^)  =  0  with 


1  1 

g(<t>)  ■■=  -77  2  (t-T-,)  f 
n  (t)  /  =  1  <j> 


"2 /:,(t- r,)e 
1=0 _ 

"sKi  [l-e"^"“^'^ 
/=o  I 


‘  2  ^  _ 

j=0  J 


It  can  be  shown  (see  Van  Pul  (1990))  that  the 
ML-estimators  are  consistent,  asymptotically  nor¬ 
mal  distributed  and  efficient. 

Let  us  consider  the  PGIR  model  again  as 
given  by  (2)-(5).  Note  that  the  process  N(t)  is 
unobservable.  Thus  defining  the  nitrations 

\  n(s):0^s<f  (15) 

:=  ■{  n(s).Ar(s)  :  0<s<i  (16) 

we  notice  that  the  intensity  A  given  in  (5)  is  actu¬ 
ally  A^,  the  intensity  function  of  the  counting  pro¬ 
cess  with  respect  to  the  filtration  9?,.  With  use  of 
the  Innovation  Theorem  (see  e.g.  Bremaud 

(1977)),  and  another  application  of  lemma  1  we 
can  show  that  the  intensity  function  under  the 
filtration  (only  observing  the  counting  pro¬ 
cess  «(r),  0<s<r  and  the  software  changes 

AT,-,  I  =0..n  (/  — ))  is  given  by 


A^(r)  :=  p  Ip  2  AT/C 


"(t)  n(T)  -  I  I 

X  2  a,  n  2  a, 
1=0  1=0  y=0 ^ 


An  interesting  idea  seems  to  set  all  the  AT,  equal 
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to  some  K  except_for  K^»K.  With  parameters 
Nq'.-iiKq  and  the  failure  intensity 

becomes 

X(/  :4,,No,Ny  =  No4,e  +  K,e  ~  (18) 

i  =  1 

In  this  three  parameter  model,  N,  the  average 
number  of  faults  introduced  per  repair  action,  can 
be  interpreted  to  account  for  dependencies 
between  faults.  Whenever  hidden  faults  become 
observable  because  of  a  fault  repair,  this  can  be 
considered  as  the  introduction  of  new  faults. 
Finally  note  that  for  N  =  0  the  above  model 
reduces  to  the  well-known  model  of  Goel- 
Okumoto  (1979). 


4.  Regression  models 

The  PGIR  model  can  be  seen  as  a  a  special 
case  within  a  general  class  of  regression  models. 
In  the  previous  section  I  assumed  that  the  Nj 
were  Poisson  distributed  with  a  parameter 
depending  on  a  single  software  measure.  Because 
the  process  of  introducing  new  faults  is  so 
difficult  to  understand,  it  seems  appropriate  to 
use  explanatory  variables  and  apply  regression 
analysis.  We  therefore  suggest  the  following  class 
of  models  gven  by  (2),(3),(5)  and 


Ni 


=  P01{X,) 

:=  exp[^,r,!  +  ...  +  ^„,z” 
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where  the  zfj  —  are  the  known  realisations 
of  m  software  measures  Z-i  (like  e.g.  size,  com¬ 
plexity,  number-of-loops)  at  time  T)  and  where 
the  =  denote  the  corresponding  regres¬ 
sion  coefficients  we  have  to  estimate.  Statistical 
methods  are  available  to  investigate  whether  cer¬ 
tain  explanatory  variables  are  redundant  (or  not) 
and  whether  their  influence  is  linear,  via  another 
power,  or  say  logarithmic. 


5.  Concluding  remarks 

We  have  constructed  a  model,  which  is  able 
do  deal  with  imperfect  repair  and  software 
growth.  Moreover,  the  ML-estimators  for  the 
model  parameters  have  desirable  asymptotic  pro¬ 
perties. 

In  the  field  of  regression  models  for 
software  reliability,  there  is  in  my  opinion  a  lot  of 
interesting  research  still  to  be  done.  Essential  will 
be,  however,  the  collection  of  real  data  (computa¬ 
tion  of  various  software  measures)  by  software 


developers.  So  far,  we  did  not  get  much  response 
from  them.  Perhaps  they  should  read  Rook’s 
(1990)  Handbook  on  Software  Reliability.  In  its 
preface  Boehm  resignedly  states:  "Sometime  soon, 
software  reliability  is  going  to  become  a  highly  visi¬ 
ble  and  important  field.  Unfortunately,  given 
human  nature,  its  thrust  into  prominence  will  only 
happen  once  we  experience  the  software  equivalent 
of  the  Chernobyl,  Bhopal,  or  space  shuttle  Chal¬ 
lenger  disasters.  Such  a  disaster  is  likely  to  happen 
in  the  next  few  years...". 
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ABSTRACT 

There  is  much  from  statistical  methodology  that  can  be 
brought  to  the  testing  of  software  systems.  Both  a  general 
paradigm  for  testing  and  an  approach  to  assurance  of 
reliability  will  be  derived  from  statistical  methods.  The 
testing  paradigm  provides  both  a  general  approach  to  the 
specification,  design  and  analysis  of  tests  and  potential  for 
reduced  complexity  of  test  equipment  (thus  reduced  cost  of 
testing).  The  testing  for  reliable  software  approach  gives  us 
a  model  upon  which  to  start  building  theory  for  the 
automated  generation  of  tests  which  go  beyond  requirements, 
capturing  the  intelligent  behavior  of  the  experienced  test 
engineer.  Relationships  between  this  framework  for  testing 
and  work  on  statistical  advisory  systems  for  the  design  of 
experiments  and  semantic  understanding  of  text  will  be 
identified. 

Introduction 

The  goal  of  this  paper  is  to  provide  connections  between 
statistics  and  Uie  testing  of  software  based  systems  which  are 
not  as  obvious  as  software  reliability  growth  modeling. 
These  connections  have  been  derived  from  the  past  ten  years 
of  testing,  examining  the  process  of  testing  and  managing 
the  testing  of  embedded  software  in  the  avionics  industry. 
We  will  first  examine  an  operating  paradigm  for  the 
development  of  effective  tests  and  then  look  at  how  this 
paradigm  may  lead  toward  automation  of  the  more  creative 
aspects  of  die  test  development  process. 

This  work  has  been  evolving  in  the  context  of  the 
development  and  test  of  systems  which  have  significant  user 
interfaces.  A  major  characteristic  of  these  systems  is  that  the 
user  cannot  be  extricated  from  the  system  itself.  Proper 
operation  depends  on  the  appropriate  interaction  between  the 
system  and  die  user.  Any  breakdown  of  this  interaction  can 
be  the  trigger  of  a  failure.  The  resulting  system  is 
necessarily  stochastic. 

In  systems  without  significant  user  interface,  there  is  at  least 
the  potential  to  produce  a  sufficiently  rigorous  specification 
and  implementation  to  remove  the  stochastic  problems 
created  by  the  user.  Such  systems  have  the  potential  for 
formal  proof  of  correctness  which  eliminates  the  need  for  an 
experimental  approach  to  verification  and  validation. 


Tests  arc  Expgrimgnls 

TesUng  software  based  systems  (or  any  testing)  is  an 
experimental  process  aimed  at  determining  the  state  of  the 
system  under  test  with  respect  to  some  standard  (possibly 
not  determinisUc).  As  experiments,  paradigms  for  statisUcal 
design  of  experimentation  (DOE)  can  be  applied  to  improve 
our  understanding  and  further  advance  the  state  of  the  art  of 
software  tesUng. 


; _ Figure  1 _ 

A  SlQChastic  .Environment 

Generally  software  tests  are  considered  to  be  determinisUc 
with  a  clear  pass/fail  criteria.  This  is  often  not  the  case  in 
embedded  systems  and/or  systems  with  significant  user 
interfaces.  A  normal  embedded  software  test  environment 
looks  something  like  figure  1 .  Sources  of  random  variation 
arc  the  user  (or  test  engineer)  and  often  the  test  equipment 
which  is  attempting  to  simulate  the  operational  environment 
of  the  system. 

As  mentioned  in  the  inuoduction,  the  user  will  introduce  a 
significant  stochastic  element  into  the  test  environment.  It 
is  not  desirable  to  remove  this  stochasUc  element  since 
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doing  so  makes  our  test  input  disu-ibution  even  less  like  the 
operational  distribution  the  system  must  operate  under.  It  is 
then  less  likely  to  catch  problems  which  are  important  to 
the  user. 

If  we  are  dealing  with  a  system  which  is  using  hardware  at 
or  near  the  limits  of  current  technology,  the  test  equipment 
may  not  be  capable  of  providing  adequate  control  to  assure 
deterministic  operation.  The  result  is  again  random 
variation.  Even  if  we  are  not  operating  at  the  leading  edge  of 
technology,  test  equipment  which  does  not  utilize  expensive 
state  of  the  art  components  to  provide  a  fully  deterministic 
environment  is  sometimes  preferable. 

Relation  to  Statistical  Methods 

Test  objectives  are  equivalent  to  models  and  hypotheses  in 
experimentation.  The  model  defines  the  portion  of  the 
system  of  interest  in  the  objective.  Hypotheses  then  focus 
the  test  upon  particular  components  of  the  model. 
Experimental  design  principles  can  then  be  applied  to  the 
design  of  the  test  which  provides  the  requirements  for  the 
test  environment.  These  requirements  will  include  the  degree 
of  accuracy  necessary  to  provide  sufficient  power  to  the 
hypothesis  tests  which  make  up  the  pass/fail  criteria  for  the 
testing. 

Requirements  Based  Testing 

Requirements  testing  is  based  upon  hypotheses  which  arc 
selected  from  the  system  requirements.  Models  selected  arc 
subsystems  'carved'  from  the  system  architecture  and  chosen 
for  compactness  (minimal  external  connections)  to  reduce 
test  environment  requirements  while  encompassing  a 
minimum  of  extraneous  system  components  not  directly 
related  to  the  hypothesis  of  interest.  From  this  point,  the 
test  task  is  just  the  design  of  the  necessary  environment  to 
provide  the  control  and  data  collection  necessary  to  carry 
through  the  experiment. 

This  whole  process  is  almost  identical  to  the  process  of 
selection  of  appropriate  experimental  factors  which  provide  a 
sufficiently  powerful  test  of  hypothesis  about  the  safety  or 
efficacy  of  a  drug  in  pharmaceutical  testing.  The  wealth  of 
statistical  methods  applied  in  the  certification  of  drugs  in  the 
pharmaceutical  industry  arc  then  clearly  applicable  to  the 
qualification  (or  certification)  testing  portion  of  software 
systems  (or  any  other  system  for  that  matter). 

In  figure  2  we  sec  a  flow  chart  for  the  basic  process 
undergone  by  a  lest  engineer  in  developing  qualification 
tests.  Current  versions  of  these  systems  are  cither  human,  or 
contain  trivial  ‘engines’  that  just  regurgitate  requirements 
typed  in  by  the  test  engineer.  Next  generation  versions  of 
these  systems  are  expected  to  operate  as  an  advisor, 
incorporating  the  expertise  of  the  test  engineer  (user)  into 
the  process. 


In  better  understanding  this  process  and/or  automating  it,  we 
will  need  to  incorporate  the  knowledge  from  expert 
statistical  advisory  systems  into  the  experiment  design 
engine.  In  addition,  expert  knowledge  atout  what  makes 
‘good’  hypothesis  for  qualification  testing  must  be 
incorporated  into  the  hypothesis  engine. 

Not  shown  in  the  diagram  is  the  analysis  portion  of  the 
process  which  would  take  test  data  and  help  to  generate  the 
report.  Each  of  these  components  could  likely  be  derived 
from  existing  work  in  various  forms  of  statistical  expert 
systems  for  design,  analysis  and  inference. 


DOE  in  a  digital  environment 


The  application  of  experimental  design  is  not  a  straight 
forward  exercise  in  the  digital  non-linear  world  of  software 
systems.  On  the  surface,  it  would  appear  that  we  will  run  in 
to  a  combinatorial  explosion  of  necessary  lest  conditions 
based  on  the  discrete  nature  of  the  systems  inputs  and 
operation.  This  is  not  necessarily  the  case  however. 
Fractional  factorial  designs  can  be  used  to  reduce  the 
combinatorial  explosion,  and  aggregation  (high  level  views) 
can  be  u.sed  to  treat  the  system  as  essentially  analog  and 
linearizable. 

Consider  a  software  module  which  has  as  its  primary  input  a 
7  bit  integer.  If  we  view  the  7  bits  as  independent  two 
valued  inputs,  we  can  lay  out  an  orthogonal  design  (2^"‘^  or 
a  Taguchi  Lg)  giving  us  an  orthogonal  cross  section  of  the 
input  space  of  the  module  with  respect  to  the  primary  input. 
Assuming  a  parcto  effect  in  faults  and  no  high  level 
interactions  (singularities),  we  have  an  efficient  set  of  test 
ca.scs  for  the  module.  Certainly  more  efficient  than  selecting 
a  couple  of  integer  values  at  extrema  or  randomly  from  the 
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input  range  and  not  requiring  more  information  about  the 
module  than  its  interfaces. 

Pareto  assumptions  are  reasonable  in  most  software 
development  environments  today.  All  we’re  really  assuming 
here  is  that  the  software  faults  are  not  ’dense’  in  the  code. 
The  absence  of  singularities  is  a  bit  harder  to  deal  with. 
Techniques  such  as  data  dithering  and  data  diversity  are 
available  to  reduce  the  granularity  of  singularities.  The 
problem  cannot  be  entirely  eliminated  though. 

Application  of  experimental  design  can  be  made  reasonably 
obvious  for  some  situations  by  taking  a  high  level  view  of 
the  system.  Many  functions  which  are  implemented  in  an 
embedded  digital  system  (such  as  navigation)  arc  inherently 
analog  in  nature  with  only  an  overlying  nonlinear  mode 
structure.  Within  any  particular  node,  the  operation  of  the 
system  is  entirely  analog  at  these  high  levels  and 
experimental  design  or  response  surface  methods  are  directly 
applicable. 

Ad  Hoc  Testing 

Testing  to  assure  reliability  of  software  systems  (failure  free 
operation)  must  go  beyond  adherence  to  requirements  to 
address  the  validation  of  the  system  in  an  indeterminate 
environment.  This  immediately  implies  the  application  of 
statistics  to  the  problem.  But  more  than  just  statistical 
procedures  arc  applicable.  The  entire  paradigm  of  statistical 
modeling  and  hypothesis  testing  comes  into  play  in  this 
environment.  This  type  of  testing  is  what  is  called  ad  hoc. 

Requirements  based  testing  can  only  get  at  a  portion  of  the 
aspects  of  a  real  system.  This  is  because,  as  depicted  in 
figure  3,  in  the  real  world,  expectations,  specifications  and 
the  reality  of  implementation  seldom  coincide.  Initially, 
we’re  lucky  if  they  are  not  disjoint. 


In  addition  to  the  basic  problem  of  bringing  user 
expectations  and  specifications  in  line  with  what  is 
physically  realizable,  we  have  additional  difficulties  which 
make  real  world  systems  more  ‘interesting’  than  ideal. 
Usually,  a  user’s  requirements  are  handed  to  the  developer  in 
the  form  of  descriptions  of  responses  to  a  limited  region  of 
the  input  domain  represented  by  the  lines,  disk  and  single 
point  in  figure  4.  Analysis  attempts  to  produce  a  ‘convex’ 
region  which  encompasses  these  initial  requirements.  Design 
then  proceeds  to  move  this  convex  region  toward  a  real 
implementation.  In  the  process  the  highly  non-convex 
region  depicted  results  from  errors  and  physical  constraints. 


_ Figure  4. _ 

Ad  hoc  testing  then  is  aimed  at  those  regions  of  the  system 


shown  outside  the  abstract  .solution  space  in  figure  4.  In  this 
region  lies  problems  which  are  due  to  implementation  which 
goes  beyond  requirements  and  user  expectations  which  don't 
show  up  in  the  requirements.  For  example,  the  well  known 
ability  of  an  early  version  of  F-16  software  that  allowed  the 
test  pilot  to  raise  the  landing  gear  on  the  ground,  probably 
lies  near  the  tip  of  one  of  the  lobes!  Idealistically,  no  such 
aspects  of  the  system  exist,  but  as  can  be  seen,  this  is 
effectively  impossible  if  for  no  other  reason  than  that  the 
user’s  expectations  arc  never  constant  or  clear  (since  they  are 
necessarily  developed  and  interpreted  by  humans). 

Iterative  Learning 

The  interactive  ad  hoc  test  process  is  an  iterative  learning 
process  much  like  that  expressed  in  Box,  Hunter  and  Hunter. 
A  test  engineer  begins  with  an  initial  hypothesis  and 
as.sociatcd  model  of  the  system  and  iteratively  hones  each 
into  a  clearer  understanding  of  the  system.  This  iterative 
process  is  depicted  in  figure  5  as  a  tree  suucturc.  Each  level 
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of  the  process  takes  the  current  hypothesis  and  from  it 
specifies  an  appropriate  model,  designs  and  executes  the 
associated  experiment  and  uses  the  conclusions  to  refine  and 
select  the  hypotheses  for  the  next  level.  Note  that  the 
outcome  of  this  refinement  and  selection  may  be  a  jMstcrior 
distribution  on  the  possible  hypotheses  which  leads  to 
multiple  paths  through  the  tree. 

In  an  expert  system,  the  inference  engine  takes  an  initial  set 
of  facts  and  proceeds  to  ‘fire’  rules  from  these  facts  to  deduce 
further  facts.  The  primary  difference  in  the  ad  hoc  testing 
process  is  that  the  rule  ‘firing’  is  actually  an  experimental 
process  used  to  derive  the  rules  from  the  real  world  rather 
than  a  data  ba.se. 


Preliminary  Hypothesis  [HO] 

HI  H2 

Hit  1  [  H12  1  H21  I  H22  | 

Figure  5. 

Abstract  system  models 


Many  experienced  test  engineers  can  find  problems  in  a 
system  without  knowing  more  than  a  rudimentary  suucture 
for  the  system.  This  sugge.sts  that  the  source  of  the  initial 
hypothesis  from  which  general  hypothc.ses  and  working 
mt^els  can  be  derived  is  a  high  level  abstract  model  of  the 
type  of  system  being  tested.  These  experienced  test  engineers 
use  (even  if  they’re  not  aware  of  it)  a  high  level  abstract 
model  of  the  system  to  guide  the  selection  of  hypothesis 
without  falling  back  upon  requirements.  These  models  are  a 
combination  of  operational  experience  and  experience  with 
the  kinds  of  things  which  can  and  do  go  wrong  in  the 
development  process  and  in  real  systems.  Development  of 
these  abstract  models  can  benefit  from  work  in  extracting 
semantic  information  from  text. 


model  from  existing  software  systems.  This  is  where  current 
work  in  semantic  recognition  comes  in.  We  can  think  of  a 
program  as  a  living  book.  The  basic  story  or  class  of  stories 
is  fixed,  but  the  details  of  the  current  instantiation  of  the 
story  depend  upon  the  data  fed  to  the  program. 

Statistical  analysis  of  the  digraphs  which  represent  the 
programs  along  with  the  variable  and  procedure  names 
(assuming  they’re  done  with  reasonable  mnemonics)  can 
help  us  to  develop  the  kind  of  basic  ‘story’  lines  that  are 
most  often  used  in  various  classes  of  software.  In  these  story 
lines  we  have  an  abstract  view  of  the  underlying  abstraction 
which  provides  the  basis  for  the  software. 
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In  order  to  automate  this  process  or  improve  our  own 
capability  of  developing  sy.stcms  we  need  to  understand  and 
be  able  to  produce  this  abstract  model.  Most  test  engineers 
don't  even  realize  they  are  working  within  this  paradigm, 
much  less  be  able  to  transfer  the  knowlalge  of  the  model  to 
an  expert  system.  Alternatively,  we  can  derive  the  abstract 
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Interfacing  Physiologically-based  Pharmacokinetic  Modeling  and  Simulation  Systems 
Derek  B.  Janszen  and  M.C.  Miller,  III  Biostatistics,  Epidemiology  &  Systems  Science 
Medical  University  of  South  Carolina,  Charleston,  SC  29425 

ABSTRACT  limited  model  assumes  that  the  transfer  across  the  cell 

^  The  graphical  user  interface  of  a  physiologically-based  membrane  is  rate  limiting,  and  thus  reduces  the  tissue  model 

pharmacokinetic  (PB  PK)  modeling  and  simulation  system  to  2  subspaccs.  Our  system  admits  any  of  the  above  three 

developed  for  the  Macintosh™  computer  is  described.  The  configurations  for  a  tissue.  The  system  also  permits  the 

user  interactively  specifies:  1)  the  anatomical  structure  of  the  specification  of  four  physiological  processes  which  can 

model  (tissues)  and  the  anatomical  structure  of  each  tissue;  affect  the  distribution  and  flux  of  a  substance;  transport 

2)  physiological  relationships;  3)  uansport  characteristics;  across  a  membrane,  binding,  excretion,  and  metabol- 

4)  thermodynamic  properties  of  the  substance.  ism.  A  linear  or  non-linear  formulation  can  be  used  for 

The  interface  utilizes  four  independent  interactive  win-  modeling  these  processes,  depending  on  the  available 

dows:  Model,  Parameter,  Kinetics,  and  Solution.  The  user  information  for  a  given  process.  Generally,  a  PB  model  may 

selects  tissues  for  the  model  and  an  exposure  route  from  a  include  any  or  all  of  these  processes, 

flow  diagram  consisting  of  nine  different  tissues  and  four 

possible  routes  of  exposure,  or  from  a  menu.  Assumptions  IMPLEMENTATION  OF  PK  MODELS 
limiting  the  rate  of  mass  transfer  can  be  specified  for  each  Although  the  PB  approach  to  PK  modeling  is  the  method 
tissue.  Parameters  for  each  tissue,  as  well  as  dosage  of  choice,  there  is,  among  some,  reluctance  to  use  this 

parameters,  are  entered  via  dialog  boxes.  This  method  of  approach  because  of  the  mathematics  associated  with  the 

specifying  the  model  parameters  encourages  "What  if...?"  method  (D'Souza  and  Boxenbaum,  1988).  Nevertheless, 

scenarios.  The  model  is  cast  in  an  S-system  format  for  case  progress  in  the  development  of  computer  software  for  solv- 

of  solution  and  for  added  flexibility  in  simulating  inherently  ing  the  system  of  differential  equations  generated  by  these 

nonlinear  biological  systems.  The  system  generates  a  steady-  models  is  being  reported. 

state  solution,  which  can  be  plotted  as  multiple  tissue  The  literature  describes  three  modeling  methods.  They  in- 
concenbation-time  curves  on  a  configurable  graph.  The  data  elude  utilization;  1)  of  a  fixed  model  simulator  where  the 
files  can  be  exported  to  other  graphics  and  statistics  number  and/or  types  of  tissues  are  fixed  (Bloch  et  ai,  1980; 

packages.  The  pictorial  flow  diagram,  a  table  of  all  tissue  Gabrielsson  and  Hakman,  1986;  Menzel  et  ai,  1987);  2)  of 

parameter  values,  the  steady-state  solution  set,  and  the  a  general  simulation  system  (Blau  and  Neely,  1987);  and 
graphic  plots  can  be  printed. ,  3)  of  spreadsheet-based  simulators  (Ball  et  al.  ,  1985; 

Johanson  and  Niislund,  1988). 

INTRODUCTION  At  the  heart  of  these  modeling  methods  are  the  algorithms 

Physiologically-based  pharmacokinetic  (PB  PK)  models  used  to  solve  the  system  of  differential  equations.  Since 

utilize  a  system  of  lumped  compartments  which  arc  designed  algebraic  solutions  are  not  available  for  these  complex 

on  the  basis  of  the  actual  anatomy  and  physiology  of  the  models,  they  must  be  approximated  by  numerical  methods, 

species.  Model  parameters  fall  into  four  broad  categories:  Two  terms  used  to  characterize  these  numerical  methods  are 

1)  anatomical,  e.g.,  organ  volumes  and  tissue  sizes;  accuracy  and  efficiency:  by  accuracy  is  meant  the  error 

2)  physiological,  e.g.,  blood  flow  rates  and  enzyme  (difference)  between  the  numerical  solution  and  the  true 

reaction  rates;  3)  thermodynamic,  e.g.,  drug-protein  solution;  by  efficiency  is  meant  the  "cost"  of  the  solution 

binding  isotherms;  and  4)  transport,  e.g.,  membrane  per-  in  terms  of  convergence  of  the  estimation  procedure,  which 
meabilities  (Himmelstein  and  Lutz,  1979).  is  generally  equated  to  computer  lime.  Some  of  these 

The  first  step  in  the  development  of  a  PB  model  is  to  methods  are  very  simple,  easy  to  program,  and  are  efficient; 

select  the  number  and  type  of  tissues.  Once  the  tissues  are  their  disadvantage  is  that  they  do  not  give  very  accurate 

selected,  a  flow  scheme  is  drawn  with  the  desired  regions  de-  results.  Other  methods,  while  achieving  better  accuracy,  are 

scribing  the  species  anatomically  (Figure  1).  The  liver,  gut,  more  difficult  to  program  and  are  less  efficient, 

spleen,  and  pancreas  (enterohepatic  system)  are  intcrcon-  The  modeling  system  described  in  this  paper  utilizes  the 
nected  anatomically,  maintaining  the  physiological  basis.  Power-Law  Formalism  (Savageau,  1969;  Voit,  1991).  It 
Each  tissue  is  initially  considered  to  consist  of  three  admits  several  system-modeling  sbategies.  Table  A  is  a 

homogeneous  subspaces:  (a)  a  vascular  space  through  mathematical  representation  of  a  generalized  tissue  with 

which  the  tissue  is  perfused  with  blood;  (b)  an  interstitial  three  subspaces,  denoted  by  the  subscripts  1,  2,  and  3.  This 

space,  which  forms  a  matrix  for  the  tissue  cells;  and  (c)  an  model  expresses  changes  in  mass  in  terms  of  blood  and 

intracellular  space  consisting  of  the  tissue  cells  that  tissue  concenuations,  and  general  flux  and  biotransformation 

comprise  the  organ  (Figure  2).  terms.  In  this  representation,  biotransformation  (metab- 

Rate-limiting  assumptions  may  simplify  the  3-subspace  olism/cxcretion)  can  only  occur  in  the  "cellular”  subspace, 

model  to  one  or  two  subspaces.  The  flow-limited  model  This  representation  also  permits  this  set  of  equations  to  be 

has  a  single  space  and  is  used  to  model  ti.s.sucs  that  arc  not  used  for  modeling  flow-limited  and  membrane-limited  config- 

wcll  perfused  by  the  circulatory  system.  The  membrane-  urations  by  setting  appropriate  terms  equal  to  zero. 


Pharmacokinetic  Modeling  115 


Table  B  describes  the  characteristics  of  the  possible  tissue 
configurations  admitted  by  the  generalized  3-subspace  model. 
The  number  of  subspaces,  presence  or  absence  of  biotrans- 
formalion  and  flux  terms,  type  of  flux  (ACTive  or 
PASsive),  and  type  of  biotransformaiion  (LINcar  or 
Michaelis-Menten)  are  specifically  enumerated. 

Table  C  is  the  S-sysiem  representation  corresponding  to 
the  linear  system  of  Table  A.  This  set  of  three  S-system 
equations  can  describe  all  possible  configurations  for  a 
3-subspace  tissue.  In  the  S-system  approach  the  different 
configurations  arc  admitted  by  altering  the  values  of  the 
parameters  according  to  the  rale  laws  that  arc  in  effect  for  a 
particular  configuration. 

Our  "PB-PK"  modeling  and  simulation  system  is  a 
flexible  and  generic  PB  PK  modeling  and  simulation  system 
developed  for  the  Macintosh™  computer.  The  user  inter¬ 
actively  specifics:  1)  the  anatomical  structure  of  the  model 
(tissues)  and  the  anatomical  structure  of  each  tissue  (i.e.,  the 
parameters  of  the  vascular,  interstitial,  and  intracellular  sub¬ 
spaces);  2)  physiological  relationships  (blood  flow  rates  for 
each  tissue,  metabolism  and  excretion  of  the  substance); 
3)  transport  characteristics,  which  also  entails  identification 
of  flow-  and  membrane-limitations;  and  4)  thermodynamic 
properties  of  the  substance  (tissue  partition  coefficients). 

The  graphical  user  interface  closely  adheres  to  the  human 
interface  guidelines  proposed  by  Apple  Computer  (1987). 

The  application  has  four  independent  interactive  win¬ 
dows;  Model,  Parameter,  Kinetics,  and  Solution. 
The  content  of  each  window  can  be  printed,  and  the  model 
(including  parameters)  and  simulation  data  (sim-data)  saved 
independently  as  files.  The  sim-data  file  format  allows  it  to 
be  exported  to  other  graphics  and/or  statistical  applications. 

The  user  defines  the  anatomical  model  in  the  Model 
window  (see  Figure  1).  This  requires  selection  on  a  flow 
diagram  consi.sting  of  a  subset  of  the  nine  different  tissues 
identified  in  the  window:  lung;  heart;  liver;  gut;  spleen: 
kidney;  muscle;  testes;  and  "other".  There  are  four  possible 
routes  of  exposure:  intravenous  (IV),  intramuscular  (IM), 
oral,  and  inhalation. 

Parameters  for  the  tissues  are  entered  by  means  of  dialog 
boxes  (Figure  3).  The  u.ser  chooses  the  tissue  configuration, 
depending  on  rate-limiting  a.ssumptions.  The  number  of 
parameters  to  be  specified  in  the  dialog  box  is  a  function  of 
this  selection.  An  array  showing  the  values  of  all  the  model 
parameters  is  displayed  in  the  Parameter  window. 

Exposure  route  parameters  arc  also  entered  via  a  d  ialog  box. 
The  dosage  regimen  (Figure  4)  admits  a  bolus  or  continuous 
dose,  with  the  user  able  to  specify  the  time  at  which  the 
dosing  occurs,  as  well  as  the  fraction,  F,  that  is  absorbed 
into  the  blood.  Because  of  the  modular  format  u.scd  in  the 
development  of  this  software,  it  will  be  possible  to 
incorporate  more  complicated  dosing  regimens  ie.g.,  the  uni¬ 
versal  elementary  dosing  regimen  (Scbalt  and  Krccft,  1987)) 

The  Kinetics  window  (Figure  4)  displays  the  results  of 
the  simulation  once  the  model  has  been  selected  and  the 
parameters  entered.  Dialog  boxes  arc  linked  to  this  window 
to  allow  for  configuration  of  the  graph  (time  in  hours/days. 


selection  of  which  tissues  or  metabolites  to  graph,  etc.), 
plotting  of  experimental  data,  and  for  any  other  parameters 
needed  for  solution  of  the  set  of  differential  equations. 

The  Solution  window  displays  the  resulting  simulation 
data  in  a  columnar  format.  The  user  can  specify  the 
frequency  with  which  the  time  points  arc  displayed  {e.g., 
every  sixth  time  point). 

The  set  of  differential  equations  generated  by  the  selection 
and  specification  of  tissues  arc  solved  by  incorporating  the 
necessary  modules  from  ESSYNS’^'^,  an  interactive  program 
written  for  the  analysis  of  mathematical  models  expressed  in 
S-system  form  (Irvine  and  Savageau,  1990;  Voit  et  al..  1989). 

DISCUSSION 

As  in  all  simulation  systems,  our  modeling  system  is 
dependent  on  external  estimation  of  PK  parameters  used  in 
the  model.  These  estimates  may  be  derived  from:  the 
literature;  the  investigator's  previous  experience;  classical 
parameter  estimation  experiments;  or  reflect  a  hypothesized 
value.  Although  many  physiological  parameters  arc 
available  in  the  literature,  others,  such  as  binding  constants, 
frequently  arc  not.  When  experimentation  is  not  possible  in 
humans  the  investigator  must  rely  on  in  vitro  or  animal 
studies. 

PB  PK  models  arc  attractive  for  a  number  of  reasons.  First 
and  foremost  they  arc  physiologically  and  anatomically 
correct.  Second,  they  admit  non-linear  relationships.  Third, 
they  may  be  cast  in  the  form  of  S-systems,  thus  making 
them  mathematically  tractable.  Fourth,  these  systems  may 
be  easily  modeled  using  our  system.  Finally,  these  models 
may  be  used  to  visually  describe  system  dynamics  and  status 
through  the  graphical  user  interface.  The  classical  approach 
to  PK  modeling  relates  dose  and  plasma  concentration.  The 
physiological  approach  goes  one  step  further  to  relate  do.so, 
plasma,  and  tissue  concentrations  (Ritschcl  and  Bancrjcc, 
1986).  ^urihormnre,  it  is  adaptable  to  changing  physiological 
circumstances  and  can  allow  for  spccics-lo-specics  and  even 
subject-to-subject  differences  within  the  context  of  the 
physiological  or  anatomical  parameters  in  the  model 
(Himmcistcin  and  Lutz,  1979).  Perturbation  of  a  particular 
parameter  allows  one  to  predict  the  changes  in  disuibution 
or  disposition  of  the  drug  during  di.sease  states,  for  instance, 
or  in  the  presence  of  another  drug.  The  combined  effect  of  a 
number  of  complex  inter-related  processes  can  also  be 
determined  provided  sufficient  data  are  available  (Ritschcl  and 
Bancrjcc,  1986). 

SUMMARY 

Physiologically-ba.scd  pharmacokinetic  modeling  is 
rapidly  gaining  acceptance  as  a  method  for  simulating  tissue 
drug  concenuations  based  on  anatomical  and  physiological 
parameters  and  thermodynamic  properties  of  the  drug. 
Currently  available  software  .systems  that  use  the  physio¬ 
logically-based  philosophy  arc  limited  by  the  assumption  of 
a  particular  type  of  physiologically-based  model.  Using  a 
simulation  language  to  define  a  complex  model  can  be 
tedious.  The  Janszcn-Millcr  "PB-PK  "  .sy.stcm  is  an  interactive 
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generic  physiologically-based  pharmacokinetic  model-ing 
and  simulation  system  wherein  specification  and  modifi¬ 
cation  of  the  model  is  facilitated  by  the  graphical  user  inter¬ 
face  of  the  Macintosh™  computer.  It  allows  great  flexibilty 
in  specifying  a  model,  as  well  as  ease  of  specifying  the 
model  parameters,  and  encourages  "What  if...?"  scenarios. 
The  user  selects  tissues  for  the  model  and  an  exposure  route 
from  an  anatomical  flow  diagram  or  from  a  menu. 
Assumptions  limiting  the  rate  of  mass  transfer  can  be 
specified  for  each  tissue.  Parameters  for  each  tissue,  as  well 
as  dosage  parameters,  are  entered  via  dialog  boxes.  The 
model  is  cast  in  an  S-system  format  for  ease  of  solution  and 
for  added  flexibility  in  simulating  inherently  nonlinear  bio¬ 
logical  systems.  The  system  generates  a  steady-state  solu¬ 
tion,  which  can  be  plotted  as  multiple  tissue  concentration¬ 
time  curves  on  a  configurable  graph.  The  system  allows  one 
to  examine  concurrent  concentrations  of  a  substance  and  its 
metabolite(s)  within  vascular,  interstitial,  and  cellular  com¬ 
ponents  of  a  single  tissue  or  organ;  plot  these  values  over 
time  in  the  presence  of  single  or  repeated  dosing:  plot  exper¬ 
imental  data;  and  to  generate  data  files  for  export  to  other 
graphics  and  statistics  packages.  The  pictorial  flow  diagram, 
a  table  of  all  tissue  parameter  values,  the  steady-state 
solution  set,  and  the  graphic  plots  can  be  printed. 
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Figure  1  Flov  scheme  of  a  generic  PB  PK  model 
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Figure  2.  Generalized  3-subspace  tissue 


V,C,  =  Q(C,-A-C,)-Ti,-t 

VA=  ■Hi-Tlz-'' 

VA=  T)z-'t 

Table  A.  Mathematical  representation  for  a 
general  S-subspacx  tissue  (i-dC/dl,  V- 
volume,  R-pa^tion  coefficient;  t1  -general 
flux  term,  t  -general  biotransformation  term, 
p-plasma,  1,2,3-subspace) 
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Table  B  Enumeration  of  biological  processes  for  all 
possible  configurations  of  a  tissue  text  for  details , 
a  dash  indicates  a  process  does  not  occur) 
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Figure  3  Tissue  parameter  dialog  box 


X,  =  a,x;‘'x5"-p,X^ 

X2  =  «zX!”x‘3»-p,x^ 

X3=«3XJ«  -p3X^ 

Table  C  S-system  representabon  for  a 
general  3-subspace  bssue  (g^ ,  h„  - 
kinetic  order  for  all  processes  from  jth 
space  to  ith  space,  a,))  -  rate  constants , 
p  -  plasma,  1,2,3  -  subspace) 
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■lllillllli 

■»  The  computer  technology  in  the  1990’s 
will  provide  many  opportunities  to  improve 
productivity.  A  company’s  strategy  to  integrate 
its  existing  technology  with  emerging 
technology  will  determine  how  well  it  takes 
advantage  those  opportunities.  The  goal  of  this 
paper  is  to  discuss  the  primary  factors  that 
will  impact  on  productivity  within  the 
computing  environment.  The  discussion  will 
center  on  coping  with  existing  technologies, 
computing  innovations,  automation  platforms 
and  contemporary  management  issues. .  ^ 


COPING  WITH  AND  CHANGING 
OBSOLETE  SYSTEMS 


Figure  1 


Coping  with  and  changing  obsolete 
systems. 

One  of  the  pressing  challenges  for 
management  will  be  integrating  existing 
systems  with  new  technology.  Many 
current  applications  have  been  developed 
in-house  using  methodologies  that  are  now 
obsolete.  For  example,  databases  may  have 
been  developed  when  there  was  little  or  no 
formal  database  management  system 
available.  Failure  to  keep  up  with  current 
technology  through  capital  investment  and 
continued  education  can  lead  to  aging 
home-grown  systems,  housed  in  a 
collection  of  primitive  hardware. 

Perhaps  one  of  the  major  obstacles  to 
introducing  new  computing  technology 
remains  ineffective  communication. 
Systems  managers  as  well  as  users  have 
communications  responsibilities.  Systems 
managers  should  be  well  informed  of 
changes  in  computing  technology  and 
inform  end  users  how  it  may  benefit  them. 
Users  need  to  take  the  initiative  to  clearly 
define  their  application  needs.  In 
cooperation,  these  two  groups  can  develop 
appropriate  strategies  to  mend,  change  or 
replace  existing  systems  with  new 
technology.  Additional  communication  is 
needed  with  the  general  user  community 
so  they  understand  how  change  will 
benefit  them.  Understanding  the  corporate 
culture  and  traditions  will  facilitate  these 
communications.  Figure  1,  above, 
highlights  the  essential  components  for 
effective  transition  from  obsolete  systems 
to  a  new  technology. 


Computer  Enhanced  Productivity  1 19 


Computing  Innovation. 

Computing  innovations  in  hardware, 
software  and  communication  have 
revolutionized  the  way  we  process  information. 
The  development  of  fast  computer  chips  and 
processors,  the  advent  of  new  and  flexible 
operating  systems,  and  improvements  to  data 
communication  provide  greatly  enhanced 
computing  opportunities.  Figure  2,  below, 
illustrates  some  of  the  features  of  these 
emerging  technologies. 


The  challenge  is  to  effectively  employ  these 
innovations  to  facilitate  rapid  application 
development,  data  access,  faster  information 
transfer,  resource  sharing,  enhanced  computer 
performance  and  automated  processing  for  the 
benefits  of  computer  users. 

Productivity  enhsmcement  often  requires 
large  initial  investments  of  time  and  capital.  It 
is  imperative  that  senior  management 
understands  the  costs  and  benefits  of 
computer  enhanced  productivity  improvements 
and  provide  adequate  funding. 

Productivity  in  clinical  information 
processing  -  an  example. 

Productivity  enhancement  in  clinical 
information  processing  will  involve  both 


automation  and  data  communication.  The 
major  factors  (see  Figure  3)  that  will 
impact  these  two  areas  are:  the  evolution 
of  Integrated  Services  Digital  Network  (a 
technology  that  integrates  data,  voice  and 
graphic  information  on  digital  lines), 
integration  of  remote  and  central 
processing  capabilities  and  development  of 
systems  tolerant  of  different  languages, 
software  and  hardware. 

An  example  of  productivity 
enhancements  in  clinical  information 
processing  is  the  development  of  the 
concept  of  remote  study  monitoring  (RSM) 
and  evolution  of  computing  systems  to 
support  it.  Traditional  clinical  information 
processing  often  involves  a  collection  of 
remote  site  investigators  who  treat 
patients  and  fill  out  forms  that  describe 
their  medical  history  and  responses  to 
therapy.  These  forms  are  usually  collected 
or  mailed  to  a  central  site  for  data  entry, 
data  editing  and  study  conduct  monitoring. 
Any  discrepancies  are  mailed  or 
telephoned  back  to  the  remote  investigator 
site  for  resolution.  This  process  is  often 
complicated  and  time  consuming.  RSM 
technology  has  been  developed  so  that  data 
entry,  editing,  review  and  clean  up  can  be 
done  at  the  remote  site  in  a  very  user- 
friendly  manner.  Data  from  the  remote 
site  can  be  automatically  transferred  to  the 
central  site  overnight  using  modems  and 
telephone  lines.  Study  monitors  at  the 
central  site  can  review  the  data  and 
communicate  with  remote  sites  via 
electronic  mail.  In  principle,  such  systems 
can  eliminate  some  of  the  complications, 
reduce  data  errors  and  time  delays  in 
traditional  clinical  information  processing 
systems.  Implementation  of  these  systems 
may  involve  all  the  factors  impacting 
productivity  and  automation  mentioned  in 
Figure  3. 

Other  examples  of  productivity 
enhancing  tools  in  clinical  information 
processing  include  digital  imaging  and 
electronic  note  pad  technologies.  The  first 
could  be  used  to  electronically  convert 
documents  to  digital  data.  The  second 
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could  be  used  to  directly  enter  clinical  data  on 
hand-held  electronic  note  pads  without  the  use 
of  paper  forms.  Introduction  of  automated 
data  processing  should  be  a  joint  responsibility 
between  systems  managers  and  computer 
users  as  discussed  above. 


Contemporary  Management  Challenges. 

The  advent  of  the  personal  computer  has 
dramatically  changed  the  computer  side  of  the 
work  place  for  the  knowledge  workers.  The 
human  side  of  that  work  place  has  also 
changed.  For  example,  there  is  more  cultural 
diversity  in  offices  today  than  there  was  ten 
years  ago.  Experts  predict  that  this  trend  will 
continue  into  the  future.  Senior  and  middle 
managers  must  face  the  challenge  of 
effectively  managing  groups  of  workers  with 
different  skills,  backgrounds  and  motivations. 
Specifically,  managers  and  supervisors  need  to 
stimulate  and  sustain  motivation  on  the  job, 
provide  exciting  career  potential  for  their 
workers  and  in  general  provide  a  work  place 
conducive  to  productivity  enhancements.  It  is 
unfortunate  that  many  organizations  spend 
tens  of  thousands  of  dollars  recruiting  talented 
workers  only  to  provide  little  or  no  challenge 
for  such  workers.  It  is  encouraging  to  note 
that  creative  benefits  are  being  introduced  in 
the  work  place.  Flexible  work  hours  and 
educational  assistance  are  two  of  these 


benefits.  Figure  4,  below,  illustrates 
commonly  introduced  programs. 


Conclusion. 

The  growth  in  technology  and 
emphasis  on  productivity  will  put  a 
tremendous  amount  of  pressure  on  the 
knowledge  workers  of  the  1990’s.  Unless 
properly  managed,  the  result  could  be 
excessive  stress  and  burnout,  causing  a 
decrease  in  productivity  instead  of  an 
increase,  lack  of  job  satisfaction  instead  of 
sustained  motivation  and  poor 
communication  among  peers  and  between 
supervisors  and  their  subordinates.  As  the 
authors  have  described,  successful 
implementation  of  new  technologies  to 
improve  productivity  requires  clear 
understanding  of  existing  systems  and 
corporate  culture,  firm  grasp  of  the 
benefits  of  new  technologies,  careful 
transition  planning  among  systems 
managers  and  users,  and  management 
appreciation  of  the  special  needs  of  the 
knowledge  workers. 
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^  A  number  of  methods  have  been  suggested  for  robustly 
estimating  a  linear  discriminant  function.  These  include 
substitution  of  robust  estimates  for  the  mean  and 
covariance  matrix  and  methods  which  choose  a  projection 
to  maximize  a  robust  measure  of  separation.  This  paper 
presents  results  of  Monte  Carlo  simulations  comparing 
some  of  these  methods  along  with  various  modifications  to 
see  whether  relatively  simple  methods  works  as  well  as 
complicated  ones. 

Introduction 

For  the  two  population  discriminant  analysis,  if  £,  (.),  ^  (.) 
are  the  density  functions  of  the  underlying  populations  and 
assume  equal  costs,  equal  priors,  then 

>«)  1 

fiix) 

gives  the  optimal  discrimination  rule. 

If  further  we  assume  that  f^  (x)  and  ^  (x)  are  normal  with 
common  covariance  matrbc,  then  we  have 

P'x  X  s  )  c 


where 


P  =  E-MPi  -  Pa  ) 


3.  Robust  estimate  of  /i., ,  ,  S  m  LDF:  e.g.  Huber-type 

M-estimates  (Randles  et  al.  1978). 

4.  Estimate  R  and  c  directly  to  obtain  an  optimal 
projection:  e.g.  nonmetric  discriminant  analysis(NDA) 
(Raveh  1989). 

Of  these  four  different  procedures,  nonparametric 
density  estimates  (#1)  require  large  sample  sizes  and  the 
algorithm  is  complicated;  the  disadvantage  of  transforma¬ 
tions  (#2)  is  that  each  time  to  classify  a  new  observation, 
it  is  necessary  to  go  back  to  find  the  rank  or  normal  score 
of  this  new  observation;  projection  methods  (#4)  are  very 
difficult  for  more  than  a  few  variables.  Robust  substitution 
procedures  (#3)  are  relatively  simple  and  easy  to  compute 
and  are  the  focus  of  this  study. 

The  original  purposes  of  this  study  were  to 

1.  Compare  effect  of  using  different  robust  estimates  of 
location  and  scale  in  LDF  on  misclassification  rates. 

2.  Compare  variability  of  misclassification  rates  under  dif¬ 
ferent  procedures. 

However,  the  result  of  1  indicated  that  a  very  simple 
procedure  which  I  called  MLDF  worked  about  as  well  as 
any  of  the  other  estimate  procedures.  Therefore,  we  added 
a  third  objective:  compare  the  MLDF  to  other  procedures 
of  all  types  in  the  literature. 

Some  Results  from  Simulation 

We  used  the  following  robust  estimates  of  covariance  in 
this  study 

Cov(Xi,Xj)  =R{Xi,Xj)  *MAD(Xj)  *MAD{Xj) 


c  =  -|  (Pi  +  Pjl'E'MPi  -  P2) 

In  practice  we  use  the  sample  mean  x  and  pooled  sample 
covariance  matrbc  S  for /i  and  S,  and  this  gives  the  linear 
discriminant  function  (LDF)  ,  which  is  widely  used  in 
practice.  But  the  LDF  is  not  robust  to  violations  of  the 
normality  assumptions  (Lachenbruch  et  al.  1973). 

There  are  several  approaches  to  deal  with  this  problem: 

1.  Use  nonparametric  density  estimates  of  ^(x),^(x) 
(Koffler  et  al.  1978). 

2.  Transformation:  e.g.,  rank  transformation,  normal  score 
transformation  (Conover  &  Iman  1980,  Koffler  et  al.  1982). 


where  R(Xi  )  is  Pearson’s  r,  Kendall’s  t  ,  Spearman’s  p , 
or  greatest  deviation  correlation  coefficient  Rg  (Gideon  & 
Hollister  1987).  MAD(X|)  is  the  median  absolute 
deviation.  A  Huber-type  M-estimate  for  covariance  was 
also  used.  Two  robust  estimates  of  location  were  used  in 
addition  to  the  mean:  the  median  and  a  Huber-type  M- 
estimate  (Randles  et  al.  1978).  We  substituted  these  esti¬ 
mates  of  location  and  scale  in  LDF.  In  the  simulation  we 
considered  only  bivariate  distributions,  that  is  p-2.  The 
distributional  situations  were  normal,  lognormal,  mixture 
normal  and  bivariate  Cauchy  distributions.  We  found  that 
for  all  these  situations,  the  estimate  of  the  covariance 
matrix  had  little  effect  on  misclassification  rates,  at  least 
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with  the  estimates  we  used.  The  median  worked  as  well  as 
M-estimates  for  location,  and  both  were  better  than  mean. 

Results  for  lognormal,  mixture  normal  and  Cauchy 
distribution  are  reported  in  Table  1  and  are  representive 
of  all  the  results. 

For  mixture  normal  situation,  the  two  populations  were: 


:0.9* 


-.0.9* 


Lognormal  distributions  were  generated  from  independent 
normal  with  unit  variance  and  mean  (3,0)  and  (0,0). 
Bivariate  Cauchy  random  variables  were  generated  by  the 
transformation  Y=Z/sqrt(S)  +  /i,  where  Z  is  multivariate 
normal  distribution  with  mean  0  and  covariance  S  and  S 
is  (0)  distribution  with  1  degree  of  freedom  (Johnson 
1987).  The  underlying  normal  distributions  to  generate 
Cauchy  distributions  were: 


A  rank  cutoff  point  was  used  instead  of  a  zero  cutoff 
point  in  LDF  based  on  Randles  et  al.  (1978). 

Suggestion  and  Some  Comparisons 

Based  on  the  results  above  we  chose  the  following  proce¬ 
dures  for  further  study: 

(1)  MLDF  procedure  :  Substitute  median  vector  for  the 
mean  vector  in  LDF  while  still  using  S  for  S  and  with  zero 
cutoff. 

(2)  RMLDF  procedure:  Substitute  median  vector  for  the 
mean  vector  in  LDF  while  still  using  S  for  S  but  with  rank 
cutoff  (Randles  et  al.  1978).  Rank  cutoff  point  is  used  to 
balance  the  misclassiflcation  rates  between  two  groups.  We 
chose  as  a  cutoff  point  a  point  such  that  the  relative 
proportions  of  the  misclassified  observations  of  the  two 
groups  by  the  discriminant  function  scores  were  as  equal 
as  possible. 


Table  1.  Average  Percentages  Misclassified  Using 
Different  Location  and  Scale  Estimators  in  LDF 


Location  Estimator 


Covariance 

Mean 

Median 

Huber-type  Me 

Estimator 

MN® 

39.0 

33.7 

33.3 

Pearson 

LN^* 

12.4 

13.4 

12.0 

C" 

41.6 

34.8 

35.7 

Spearman 

MN 

39.2 

32.5 

31.9 

LN 

11.8 

12.0 

11.5 

C 

42.9 

33.0 

34.1 

Kendall 

MN 

39.8 

32.8 

32.1 

LN 

11.0 

11.7 

10.9 

C 

42.5 

33.0 

33.8 

Rg 

MN 

40.0 

33.0 

32.1 

LN 

11.2 

12.0 

11.1 

C 

42.7 

33.0 

33.6 

Huber 

MN 

31.9 

LN 

11.9 

C 

34.7 

®  Mixture  Normal 
^  Lognormal 
^  Cauchy 

We  compared  these  two  procedures  with  several 
published  studies. 

(1)  A  Comparison  with  Randles  et  al.  Study  (1978) 
Randies  et  al.(1978)  introduced  a  generalization  of  LDF 
i.e.  RtH  procedure  and  also  LDF  with  Huber-type  M- 
estimate  procedure  using  rank  cutoff  RLH.  For  RtH 
procedure,  they  took  a  nondecreasing,  bounded  odd 
function  t  as  a  measure  of  separation  and  found  the 
direction  which  maximizes  this  measure.  They  considered 
the  distributional  situations  in  Table  2.  To  their  results, 
which  are  given  in  Table  3,  we  have  added  results  for 
MLDF  and  RMLDF  from  a  new  simulation  (with  different 
random  numbers).  We  see  from  this  table  that  the 
RMLDF  procedure  works  as  well  as  the  more  complicated 
RtH  procedure.  In  particular,  consider  the  situational 
situation  8,  where  the  distributions  were  contaminated,  not 
only  by  changing  the  standard  deviations  but  also  changing 
the  mean.  The  mean  is  a  relatively  bad  estimate  of 
location,  but  the  median  is  not  much  affected  by  the 
outliers  and  thus  produced  relatively  good  estimates. 
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Table  2.  Distributional  Situations 


Population  1 

Population  2 

1  ^2 

Ml  M2 

O; 

2 

N“ 

0  0 

1  1 

1 

1 

1 

1 

N 

0  0 

1  1 

1.78  1.78 

2 

3 

LN*’ 

1.65  1.65  1  1 

2.65  2.65 

1 

1 

LN 

1.65  1.65  1  1 

3.43  3.43 

2 

3 

MN‘ 

=  0  0 

2  1 

2.01 

0 

2 

1 

0  0 

20  10 

2.01 

0 

20 

10 

MN 

0  0 

2  1 

3.19 

0 

4 

3 

0  0 

20  10 

3.19 

0 

40 

30 

MN 

0  0 

2  1 

2.01 

0 

2 

1 

2.01  0 

20  10 

0 

0 

20 

10 

MN 

0  0 

2  1 

3.19 

0 

4 

3 

3.19  0  - 

40  30 

0 

0 

20 

10 

^  Nonnal 

°  Lognormal 

^  Mixture  Normal  (0.9  of  first,  0.1  of  second) 

Table  3.  Empirical  Percentages  Misclassified 

when  nl 

=  n2=30 

Situation  LDF 

RLH 

RtH 

RMLDF 

1 

29  29 

29  29 

29  29 

28  30 

2 

17  33 

27  28 

28  28 

24  29 

3 

22  31 

26  26 

26  26 

26  27 

4 

14  40 

25  26 

26  26 

26  27 

5 

40  35 

33  29 

34  30 

33  32 

6 

23  44 

30  31 

33  34 

30  33 

7 

40  39 

33  31 

36  32 

34  33 

8 

41  37 

30  31 

32  33 

30  31 

1.  Maximum  SE  of  estimates  is  1.8. 


2.  Results  for  LDF,  RLH  and  RrH  are  from  Randles  et  al.(1978). 

(2)  A  Comparison  with  Koffler  &  Penfield(1978),  Conover 
et  al.(1980) 

Koffler  and  Penfield  (1978)  used  four  nonparametric 
density  estimation  procedures;  nearest  neighbor  (NN), 
Parzen  and  Cacoulos  kernel  estimator  (P-C),  Loftsgaarden 
and  Quesenberry  estimator  (L-Q)  and  Gessaman  (GESS) 
estimator  in  the  lognormal  distributional  situations. 
Conover  (1980)  compared  rank  transformation  method 
RLDF  with  these  nonparametric  procedures.  These 
nonparametric  procedures  and  RLDF  procedure  along 
with  MLDF,  RMLDF  and  Huber  procedure  were 
compared.  Bivariate  lognormal  random  variables  were 
generated  from  independent  normals  with  unit  variance 
and  means  p  and  0  for  population  1,  and  means  0  and  0 
for  population  2,  where  u  =  1,2,3.  The  results  appear  in 
Table  4.  For  lognormal  populations,  RMLDF  is  clearly  to 
be  preferred  over  LDF.  The  MLDF  and  RMLDF  also 


compares  favorably  with  the  nonparametric  methods.  But 
it  seems  that  MLDF  method  doesn’t  work  as  effectively  as 
the  RLDF  method. 

Table  4.  Percentage  Misclassified  when 
nl=n2=64  (lognormal  situations)P 


M  =  1 

M=2 

M=3 

LDF 

34.1 

26.6 

22.5 

NN 

31.8 

22.7 

12.4 

P-C 

35.0 

19.5 

7.8 

L-Q 

34.4 

17.5 

7.0 

GESS 

30.9 

17.5 

12.4 

RLDF 

32.5 

15.4 

6.3 

LDF 

34.4 

26.2 

23.0 

Huber 

33.5 

18.5 

11.9 

(with  rank  cutoff) 

MLDF 

33.7 

21.1 

16.7 

RMLDF 

33.7 

20.5 

12.0 

®  Top  part  of  tabic  reproduces  results  from  Koffler  &  Penrield(1978). 
and  results  from  Conover  &  Iman  (1980).  Bottom  part  of  table  contains 
results  of  a  new  simulation. 

Figure  1  displays  the  plots  of  the  estimated  standard 
deviation  of  misclassification  rates  versus  average  overall 
misclassification  rate  of  the  three  procedures  taken  from 
several  simulation  situations.  If  two  procedures  have  the 
same  overall  misclassification  rate,  but  one  has  less 
variability  in  the  misclassification  rate,  then  the  fust 
procedure  would  be  preferred..  The  Huber-type  M- 
estimate  procedure  and  the  RMLDF  procedure  have  less 
variability  of  the  misclassification  rate  than  the  LDF 
procedure. 

Overall,  the  RMLDF  method  is  simple  and  appaers  to 
perform  well  relative  to  other  nonparametric  procedures. 
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Figure  1.  Standard  Deviation  of  the  Misclassification  Rate 
vs.  Average  Misclassification  Rate. 


□  ;  LDF  +  ;  RMLDF  a  :  Huber 


-Estimators  12S 


AQ-  .'007  120 

_ ...  mil  HU  nil 


92-19541 


Robustness  of  Regression  M-Estimators  over  Complex-valued  Distributions 


Krishnendu  Ghosh 

Department  of  Mathematical  Sciences 
University  of  Montana 
Missoula,  MT  59812 


Richard  M.  Heiberger 
Department  of  Statistics 
Temple  University 
Philadelphia,  PA  19122 


Abstract 

Noisy  complex-valued  data,  for  which  robust  regres¬ 
sion  techniques  are  the  natural  analysis  approach,  arise 
in  many  physical  fields.  Evaluation  of  the  efficiency  of 
such  techniques  requires  that  their  behavior  be  charted 
over  a  series  of  known  reference  distributions.  We  have 
defined  several  symmetric  long-tailed  complex  distribu¬ 
tions  (e.g.,  complex  slash,  complex  Cauchy,  complex 
double  exponential)  based  on  complex  normal  distribu¬ 
tion.  We  have  compared  via  the  maximin  method  the  ro¬ 
bustness  of  different  regression  M-estimators  (as  defined 
by  their  weight  functions)  over  these  distributions.  The 
variances  of  the  estimators  of  the  regression  coefficients 
are  obtained  by  simulation  over  all  the  distributions  and 
for  all  the  weight  functions.  The  relative  efficiencies  over 
each  distribution  are  obtained  and  then  these  relative 
efficiencies  are  compared  over  different  distributions  to 
identify  the  best  weight  function.  Three  different  sam¬ 
ple  sizes  5,  11  and  15  have  been  used  for  this  purpose. 
We  apply  our  estimators  to  the  evaluation  of  the  Mag- 
netotelluric  response  function. 

KEY  WORDS:  Robustness;  Regression;  M-Estimators; 
Complex-Distributions. 

1.  Introduction 

Many  physical  settings  provide  data  for  which  linear 
regression  is  the  physically  appropriate  analysis  tech¬ 
nique.  In  one  such  technique,  the  Magnetotelluric 
method,  the  complex-valued  Fourier  transforms  of  the 
electric  and  magnetic  fields  measured  on  the  earth’s  sur¬ 
face  are  treated  as  the  response  and  explanatory  vari¬ 
ables  respectively.  Robust  techniques  are  needed  to  re¬ 
move  high  leverage  noise  contamination  in  the  electric 
field  attributable  to  electrical  activity  in  the  ionosphere. 

In  this  paper  we  use  M-estimation,  an  iteratively 
reweighted  least  squares  technique  where  the  weight  ma¬ 
trix  u;  is  a  diagonal  matrix  with  real  positive  weights. 
The  distribution  of  the  contaminating  noise  is  not 
known;  therefore  the  best  function  for  producing  the 
weights  from  the  observed  data  is  not  known.  In  order 


to  choose  an  appropriate  weight  function  we  must  first 
explore  the  behavior  of  several  weight  functions  with  a 
variety  of  long-tailed  complex  symmetric  distributions. 

In  Section  2  we  briefly  review  the  univariate  complex 
normal.  We  then  define  several  related  univariate  sym¬ 
metric  complex  distributions. 

In  Section  3  we  discuss  M-estimation  of  the  regression 
coefficients.  We  evaluate  all  the  by  now  standard  weight 
functions  (Huber,  Cauchy,  Welsch,  Logistic,  Fair,  Ham¬ 
pel,  tanh,  biweight,  and  Andrews),  the  Thomson  weight 
function  (Chave,  Thomson  and  Ander,  1987),  and  intro¬ 
duce  a  new  function  that  we  call  the  Modified  Thomson. 

To  find  the  best  (robust)  weight  function  for  the  M- 
estimation  of  the  regression  coefficients,  we  use  in  Section 
4  the  concept  of  relative  optrim-efficiency.  We  compare 
the  performance  of  the  set  of  estimators  over  the  set  of 
long-tailed  complex  distributions  by  a  simulation  study. 

Our  recommended  procedure  for  the  M-estimation  of 
a  complex-valued  regression  coefficient  is  to  use  two  dif¬ 
ferent  sets  of  iteration,  each  based  on  a  different  weight 
function.  We  have  used  this  technique  to  improve  the 
estimation  of  Magnetotelluric  functions  in  a  companion 
paper  (Ghosh  and  Heiberger,  1991). 

2.  Symmetric  Complex  Distributions 

We  are  interested  in  those  complex  random  variables 
Z  =  Zn  +  iZj  whose  density  functions  fc(^)  are  real  and 
equal  to  the  real- valued  bivariate  density  gR{zR,  zi),  that 
is 

fc{z)  =  gR{zR,zi)  (2.1) 

Denote  the  real-vedued  marginal  densities  of  the  real  and 
the  imaginary  components  of  Z  by  hfi(zR)  and  /fc/(r/). 
We  also  require  that 

hR{u)  =  ki{u)  (2.2) 

We  list  in  Table  1  nine  different  symmetric  complex 
distributions  ordered  according  to  increasing  tail  weight 
(radius  of  the  93%ile),  ranging  from  the  almost-familiar 
complex  normal  to  the  heavy-tailed  complex  Cauchy 
with  independent  real  and  imaginary  components. 

We  give  in  Section  2.1  Goodman’s  (1963)  definition 
of  the  complex  normail.  We  constructed  the  remaining 
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distributions  and  derived  their  density  functions  (Ghosh 
1990).  The  derivations  are  straightforward  applications 
of  the  transformation  of  variables  method  and  are  ex¬ 
ceedingly  tedious. 

2.1  Complex  Normal,  CN 

Goodman  defined  a  complex  normal  random  variable 
cis  a  complex  random  variable  whose  real  and  imaginary 
parts  are  independent  bivariate  normal.  Let  Z  follow  the 
univariate  complex  normal,  to  be  denoted  Z  ~  CN(0, 1). 
Both  equations  (2.1)  and  (2.2)  are  satisfied.  The  p.d.f. 
of  Z  is  give  by 

/(^)  =  -  oo  <  Zft,  z/ <  00  (2.3) 

Independence  of  the  real  and  imaginary  components  in 
the  univariate  complex  normal  has  been  assumed  to  al¬ 
low  easy  extension  to  the  multivariate  complex  normal. 

2.2  Complex  Cauchy  with  Independent 
Components,  CC(I) 

Let  X  =  (Xfi+iXj)  ~  CN{0, 1)  and  Y  =  (Vfi-l-jy/)  ~ 
(771^(0,1),  independently.  Then 

^=(^)+*(^)  (2.4) 

is  a  Complex  Cauchy  with  Independent  Real  and  Imag¬ 
inary  Components  CC(I). 

2.3  Complex  Cauchy  with  Dependent 
Components,  CC(D) 

Let  X  =  {XR->f  iX,)  ~  CiV(0, 1)  and  Y  ~  Ar(0, 1/2) 
independent  of  each  other.  Then, 

Z=(^)  +  «(4a)  (2.5) 

is  Complex  Cauchy  with  Dependent  Real  and  Imaginary 
Components,  CC(D).  Note  that  independence  of  the  real 
and  imaginary  components  is  not  required  to  satisfy  con¬ 
ditions  (2.1)  and  (2.2). 

2.4  Complex  Slash  with  Independent 
Components,  CS(I) 

Let  X  =  {Xr  -I-  iX,)  ~  CN(0, 1)  and  yi,y2  both  ~ 
U(0,1)  independent  of  each  other  and  also  independent 
of  X.  Then, 

2=(T“)+'(ft)  (2.6) 

is  Complex  Slash  with  Independent  Real  and  Imaginary 
Components,  CS(I). 


2.5  Complex  Slash  with  Dependent 
Components,  CS(D) 

Let  X  =  (Xfi  -I-  iX/)  ~  CX(0,1)  and  y  ~  /7(0, 1) 
independent  of  each  other.  Then, 

Z={^)+i{^)  (2.7) 

is  Complex  Slash  with  Dependent  Real  and  Imaginary 
Components,  CS(D). 

2.6  Generalized  Complex  Slash,  GCS 

Y  =  (Yr  -1-  iYf)  is  said  to  follow  a  univariate  complex 
uniform  distribution  CU  in  a  unit  disk  if  its  probability 
density  function  is  given  by  l/x  for  |yp  <  1.  Let  X  = 
(Xr  -1-  iXi)  ~  CX(0, 1)  and  Y  =  (Yr  +  iYj)  ~  Cf7(unit 
disk)  independent  of  each  other.  Then, 


has  a  Generalized  Complex  Slash,  GCS,  distribution. 

2.7  Complex  t  Distribution,  CT 

Let  X  =  (XR+iXj)  ~  CN(Q,  1)  and  Y  =  (YR+iYj)  ~ 
CN(0,l)  independent  of  each  other.  Then, 


follows  a  Complex  t  distribution,  CT,  with  2  degrees  of 
freedom. 

Note  that  the  familiar  real  variable  definitions  do  not 
always  generalize  to  complex  variables  in  the  anticipated 
way.  The  real-valued  Cauchy  distribution  is  defined  as 
the  ratio  of  two  independent  standard  normal  variables. 
But  the  ratio  of  two  independent  standard  complex  nor¬ 
mal  variables  gives  a  complex-!  distribution  with  2  de¬ 
grees  of  freedom,  not  a  complex  Cauchy.  The  complex 
Cauchy  was  given  in  Sections  2.2  and  2.3. 

2.8  Complex  Double  Exponential  Distribution, 

CDE 

Let  Xj  =  (XjR  -t-  iXji)  ~  CN(0,l)  j  =  1,2, 3,4 
independent  of  each  other.  Then, 

Z  =  (A'iftX2fl-l-X3flX4fl)-f-i(A'i/X2/-t-A'^3/X4/)  (2.10) 

has  a  Complex  Double  Exponential  distribution,  CDE. 

2.9  Complex  Logistic  Distribution,  CL 

The  CL  distribution  is  defined  so  that  the  joint  distri¬ 
bution  of  the  real  and  imaginary  parts  follows  a  bivariate 
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logistic  distribution  and  each  of  the  real  and  imaginary 
components  follows  a  real  univariate  logistic  distribu¬ 
tion. 

3.  Regression  M-Estimators  and  Weight 
Functions 

The  M-estimate,  or  maximum  likelihood  type  esti¬ 
mate,  Tn  based  on  a  sample  (ij  ,X2,  -  ■■  ,xn)  of  size  N , 
is  the  value  of  t  that  minimizes  the  objective  function 
—  t).  The  loss  function  p  is  assumed  to  be 
continuous,  and  hjis  derivatives  with  respect  to  t  at  all 
values  of  t.  We  calculate  Tiv  by  finding  the  value  of  t 
that  satisfies  the  equation  Ylf=i  —  <)  =  0  where 

^(“)  =  £p(^}- 

Let  us  consider  the  linear  regression  model 

y  =  X  0  +  r  (3.1) 

Wxl  Nxg  qxl  Nxl 

The  M-estimation  process  minimizes  a  norm  of  resid¬ 
uals,  as  does  the  least  squares  process.  But  the  misfit 
measure  in  M-estimation  is  chosen  so  that  a  few  extreme 
values  cannot  dominate  the  answer.  The  M-estimate  is 
obtained  by  solving 

Ef=i  P  (^)  =  R^R  (3.2) 

where  minimization  is  done  with  respect  to  /?,  and  R\s  a 
N  X  I  vector  whose  jth  element  is  yj p{’^),  rj  =  yj -Xj0 
is  the  jth  residual,  and  d  is  a  scale  factor.  In  the  special 
case  with  />(u)  =  and  d  =  1,  M-estimation  specializes 
to  least  squares  estimation. 

Equation  (3.2)  yields  solutions  of  the  non-linear  sys¬ 
tem 

X“^  =  0  (3.3) 


Table  1.  Interquartile  diameter  <t/q  and  the  50th,  80th, 
91st  and  93rd  quantile  radii  for  complex  distributions. 


Distributions 

(TIQ 

50th 

80th 

91st 

93rd 

CN 

1.66 

0.83 

1.27 

1.55 

1.64 

GCS 

2.52 

1.26 

2.16 

2.35 

2.37 

CS(D) 

3.50 

1.75 

2.47 

2.53 

2.54 

CDE 

2.70 

1.35 

2.43 

3.27 

3.59 

CT 

2.00 

1.00 

2.00 

3.16 

3.74 

CL 

3.72 

1.86 

3.11 

4.08 

4.46 

CC(D) 

3.46 

1.73 

4.90 

10.95 

14.97 

CS(I) 

4.23 

2.11 

5.60 

12.39 

16.91 

CC(I) 

4.40 

2.20 

6.23 

13.97 

19.31 

where  is  a  x  1  vector  whose  jth  element  is  the  in¬ 
fluence  function  V’(^)-  We  solve  equation  (3.3)  by  ex¬ 
pressing  it  as  a  weighted  least  squares  problem 

X^wr  =  0  (3.4) 

where  r  is  JV  x  1  residual  vector,  w  is  N  x  N  diagonal 
matrix  of  weights  whose  jth  diagoncd  element  is  wj  = 

V’(^)^(^).  The  solution  to  equation  (3.4)  is  given  by 
iteratively  solving 

^={X^wX)-\X^wy)  (3.5) 

The  weights  at  each  iteration  are  computed  from  the 
residuals  and  scale  estimate  of  the  previous  iteration. 

A  practical  choice  of  the  scale  factor  d  is  Siq  is 
the  sample  interquartile  diameter  of  the  complex  resid¬ 
uals  and  ajQ  is  the  population  interquartile  diameter  of 
the  underlying  distribution  of  r. 

3.1  Modified  Thomson  Weight  Function  (M- 
Thomson) 

Thomson’s  weight  function  is  different  from  the  others 
listed  in  Section  1  because  it  is  data  adaptive.  The  quan¬ 
tity  a  in  Thomson’s  weight  function  is  the  nth  qu2intile 
of  the  assumed  underlying  distribution.  The  point  at 
which  the  downweighting  begins  depends  on  both  the 
underlying  distribution  and  the  sample  size  n. 

We  found  that  Thomson’s  weight  function  is  robust 
to  several  underlying  distributions  but  does  not  work 
quite  well  enough  for  very  heavy-tmled  distributions.  We 
therefore  proposed  a  new  weight  function,  a  modification 
of  the  Thomson  weight  function: 

«;(u)  =  =  exp  J(ln  a)(— (3.1) 

Table  1  displays  the  a  values  for  the  80th,  91st  and  93rd 
quantiles  of  the  complex  distributions  (corresponding  to 
n  =  5,ll,15). 

The  advantage  of  the  M-Thomson  weight  function 
over  Thomson’s  function  comes  from  the  change  in  the 
base  of  the  exponential  as  a  changes.  It  downweights 
the  potential  outliers  as  the  sample  size  increatses  to  a 
greater  extent  than  does  Thomson’s  weights.  For  small¬ 
tailed  distributions  like  CN,  GCS  and  CS(D),  the  M- 
Thomson  function  puts  more  weight  on  the  valid  data 
and  also  protects  non-outliers  from  too  much  down¬ 
weighting.  For  mid-size  distributions  like  CDE,  CT  and 
CL,  M-Thomson ’s  weight  function  rapidly  downweights 
data  points  that  are  beyond  the  nth  quantile.  For  large¬ 
tailed  distributions  like  CC(D),  CS(I)  and  CC(I),  the 
M-Thomson  and  Thomson  weight  functions  give  almost 
identical  results. 
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4.  A  Simulation  Study 

We  have  evciluated  the  weight  functions  listed  in  Sec¬ 
tion  1  over  the  distributions  defined  in  Section  2. 

In  order  to  find  the  best  M-estimator  (weight  func¬ 
tion)  of  the  regression  coefficients  for  complex-valued 
data  we  compare  the  diflTerent  weight  functions  using  the 
maximin  approach.  Our  presentation  is  based  on  Chap¬ 
ters  10  and  11  of  Hoaglin,  Mosteller  and  Tukey  (1983). 
We  have  s  different  weight  functions  (tyi(u),  u;2(u),  •  • 
w;j(u))  and  therefore  s  estimators  of  /? 

We  want  to  investigate  which  estimator  among 
these  is  the  most  efficient  over  a  wide  range  of  distri¬ 
butions.  The  optrim-efficiency  for  a  specified  estimator 
is  the  ratio  of  the  variance  of  the  best  estimator  (the 
optrim)  for  a  given  distribution  to  the  variance  of  the 
specified  estimator.  With  weighted  least  squares  estima¬ 
tors  /?«;,  the  optrim-efficiency  is 

optrim-Eff  (4.1) 

where  the  notation  Wk  means  the  diagonal  matrix  whose 
jth  diagonal  element  is  Wki\rj/d\). 

4.1  Procedure 

We  did  a  maximin  analysis  of  the  optrim-efficiencies 
for  three  different  sample  sizes  5,  11  and  15  and  nine 
different  symmetric  complex  distributions,  27  different 
sample  size-distribution  combinations  in  all.  The  proce¬ 
dure  we  follow  in  calculating  optrim-efficiencies  for  each 
sample  size  is;  (1)  determine  by  simulation  the  estimator 
variance  for  each  weight  function  and  distribution,  (2) 
calculate  the  minimum  variance  over  weight  functions 
for  each  distribution,  (3)  calculate  the  optrim-efficiencies 
for  each  combination  of  weight  function  and  distribution, 
(4)  find  the  minimum  efficiency  over  the  distributions  for 
each  of  the  weight  functions,  and  (5)  find  the  maximum 
over  the  estimators  of  the  minimum  efficiencies. 

Simulation  of  numbers  from  the  various  complex  dis¬ 
tributions  is  straightforward  since  the  real  uniform  and 
real  normal  are  available  in  all  software  libraries. 

4.2  Observations 

We  find  difficulties  with  most  of  the  standard  weight 
functions  in  most  long-tailed  complex  situations.  In 
particular,  Thkey’s  biweight  and  Andrews’  wave  func¬ 
tions  fail  for  long-tailed  distributions.  The  redescending 
Huber  and  Hampel  weight  functions  behave  differently 
from  the  rest.  They  may  assign  full  weight  to  potential 
outliers.  Thomson’s  weight  function  is  robust  to  sev¬ 
eral  underlying  distributions  but  does  not  work  quite 


well  enough  for  long-tailed  distributions.  The  Modified 
Thomson  function  is  often  the  best,  dominating  all  the 
others  except  with  the  very  long-tailed  complex  Cauchy 
distributions  where  it  gives  results  similar  to  the  Thom¬ 
son  function. 

4.3  Recommendations 

In  order  to  estimate  a  complex-valued  regression  co¬ 
efficient  using  the  M-estimation  technique,  we  use  two 
different  sets  of  iterations.  In  the  first  set  of  iterations 
we  choose  a  weight  function,  usually  the  redescending 
Huber  or  Hampel  weight  function  so  as  not  to  reject  too 
many  outlier  points  too  early,  and  iterate  until  the  resid¬ 
ual  norm  |r^r|  does  not  change  appreciably.  In  the  sec¬ 
ond  set  of  iterations  we  choose  another  weight  function 
dependent  on  sample  size  and  iterate  it  similarly  until 
we  get  the  desired  convergence.  For  most  sample  sizes 
we  looked  at,  the  M-Thoinson  weight  function  seems  to 
dominate.  Other  good  choices  for  the  second  set  are  the 
Thomson,  logistic,  or  hyperbolic  tan  functions. 
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Abstract:  Minimum  total  error,  or  L',  regression  estimates 
are  a  generalization  of  the  sample  median  to  prediction  prob¬ 
lems.  Multivariate  extensions  therefore  involve  the  concept  of 
a  multivariate  median.  There  are  many  inequivalent  character¬ 
izations  of  a  multivariate  median  in  the  literature,  all  of  which 
seem  to  have  at  least  one  of  two  major  difficulties:  either  they 
lack  the  property  of  affine  covariance  which  wc  have  come  to 
expect  from  ordinary  multivariate  regression,  or  they  are 
computationally  highly  unpleasant.  We  here  propose  a  defini¬ 
tion  of  multivariate  median,  inspired  by  the  theory  of  M- 
estimation,  that  transforms  appropriately  under  linear  changes 
of  variables.  Furthermore,  it  may  be  computed  straightfor¬ 
wardly  using  a  fixed-point  property.  The  result  is  a  resistant 
multivariate  regression  estimate  that  is  intuitively  appealing 
and,  surprisingly,  increasingly  efficient  at  the  normal  model  in 
higher  dimensions.  We  share  some  computational  experience 

with  this  estimator, 

I.  Introduction 

Linear  regression  is  perhaps  the  central  tool  of  modem 

statistics;  it  seeks  predictive  models  of  the  form  y  =  Xb. 

The  classical  criterion  for  fitting  this  model  goes  back  at 
least  to  Legendre,  the  method  of  least  squares: 

min  X 

^  i  =  i 

The  simplicity  and  power  of  this  procedure  is  unexcelled. 
However,  in  modem  times,  statisticians  have  become  increas¬ 
ingly  concerned  with  the  lack  of  robustness  of  the  least-squares 
technique — its  sensitivity  to  a  few  observations  for  which  the 
model  fit  is  very  poor.  Perhaps  the  oldest  technique  for  dealing 
with  this  problem  (predating  even  least  squares,  seee.g.  Stigler 
[1986])  is  the  minimum  total  error,  or  L',  criterion: 

n 

min  Yj 
1  =  1 

The  naturalness  of  this  method  is  to  some  extent  offset  by  its 
greater  computational  difficulty  and  by  its  relatively  low 
efficiency  at  the  normal  model.  However,  it  is  robust — there  is 
an  upper  bound  on  how  much  influence  any  poorly-fitting 
observation  can  have  on  predictions.  Thus,  the  minimum  total 
error  criterion  has  attracted  considerable  recent  attention. 

Extension  of  the  linear  model  to  several  dependent  vari¬ 
ables,  multivariate  regression,  turns  out  to  be  straightforward 
using  least-squares.  However,  extension  of  the  least  total  error 
criterion  to  several  dimensions  turns  out  to  be  more  problem¬ 
atic.  Consider  the  simplest  case  of  regression,  the  location 


problem  m  in  |y,  -//| .  It  is  standard  that  the  solution  p.  is  the 

^  i=t 

sample  median.  Similarly,  the  solution  of  the  least  squares 
location  problem  is  the  sample  mean.  When  we  proceed  to 
several  variables,  it  is  still  true  in  every  sense  that  the  multiva¬ 
riate  sample  mean,  with  the  obvious  definition,  is  the  solution 
of  the  least  squares  problem.  However,  even  the  appropriate 
definition  of  a  multivariate  median  is  problematic;  so  that  it  is 
not  obvious  what  is  meant  by  multivariate  L*  regression. 

A  number  of  possible  definitions  of  multivariate  median 
arc  discussed  in  Small  [1990].  Perhaps  the  simplest  is  the 
vector  of  medians  of  each  coordinate  by  itself.  This  corre¬ 
sponds  to  solving  the  least  total  error  problem  for  each  depen¬ 
dent  variable  separately.  For  some  applications  this  may  be 
reasonable,  but  it  has  one  obvious  major  flaw:  if  we  take  a 
rotation  of  the  dependent  variables,  it  is  not  generally  true  that 
the  median  of  the  rotated  data  is  the  rotation  of  the  median. 
Another  simple  definition  of  the  multivariate  median  is  ob¬ 
tained  by  extending  minimum  total  error  to  a  minimum  total 
distance  criterion: 

n 

min  Y 
1=1 

This  approach,  called  the  U  median,  which  dates  back  at 
least  to  Weber  [  1 909] ,  amounts  to  choosing  a  point  in  space  so 
that  the  stars  are  scattered  as  uniformly  as  possible  over  the 
celestial  sphere.  It  is  obviously  unaffected  by  rotations.  Unfor¬ 
tunately,  this  idea  for  a  median  fails  to  transform  nicely  if  wc 
rescale  one  of  the  coordinates  differently  from  the  others.  For 
example,  if  one  variable  is  in  inches  and  the  other  in  dollars, 
changing  the  scale  on  the  first  axis  to  centimeters  will  change 
the  median  in  a  nonobvious  way.  Since  in  statistical  practice 
our  coordinates  are  often  inhomogeneous,  the  applicability  of 
this  definition  is  too  limited. 

One  of  the  great  virtues  of  the  median  is  its  covariance 
under  any  monotone  transformation.  It  is  not  clear  that  this 
desideratum  is  achievable  for  any  multivariate  location  mea¬ 
sure.  However,  it  is  certainly  desirable,  as  our  two  examples 
suggest,  to  have  a  multivariate  median  covariant  under  as  rich 
as  possible  a  set  of  qansformations  of  the  data.  For  example, 

the  mean  is  affine,  that  is  E(a  +  Bx)  =  a  -t  P  E{x) .  ArbiU’ary 
linear  changes  of  variables  adjust  the  mean  in  the  obvious  way. 
We  shall  therefore  restrict  our  attention  to  definitions  of 
multivariate  median  that  are  affine;  a  number  of  these  arc 
discussed  in  Small’s  survey.  However,  all  of  these  concepts 
have  at  least  one  of  two  serious  drawbacks.  Either  they  arc 
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rather  difficult  to  compute,  or  they  do  not  gcncrali;.'e  in  any 
obvious  way  to  a  definition  of  multivariate  median  for  distri¬ 
butions. 

We  shall  propose  an  affine  location  criterion,  the  m- 
median,  inspired  by  Huber’s  m-cstimates,  that  is  plausibly  a 
multivariate  generalization  of  the  ordinary  median.  It  will  have 
an  obvious  characterization  on  disu-ibutions;  and  we  will 
propose  a  reasonably  efficient  method  for  computing  it.  It  has 
the  nice  property  that  it  becomes  more  nearly  efficient  at  the 
normal  model  as  the  dimensionality  increases.  Extension  to  a 
general  tool  for  robust  multivariate  regression  will  be  straight¬ 
forward. 

II.  Multivariate  Location  Estimation 

Least  squares  and  minimum  total  error  are  each  special 


cases  of  the  If  location  estimate  which  solves  m  in  ^  [y, 

^  1  =  1 

.  It  is  this  which  we  shall  generalize  to  several  variables. 
Following  the  lead  of  Gauss,  we  recognize  that  multivariate 
least  squares  arises  as  maximum  likelihood  estimates  of  the 
parameters  in  the  multivariate  normal  family  of  densities 


The  family  is  affine,  with  the  transformation  rule  for  the  mean 
as  above  and  for  the  covariance  matrix  V'-BVB^  :  the 


location  estimate  is  the  multivariate  mean  and  the  scale  esti¬ 
mate  is  the  sample  covariance  matrix.  This  suggests  a  natural 
way  to  achieve  an  affine  LF  location  statistic;  let  it  be  the 
maximum  likelihood  estimate  of  the  parameters  in  the  family 

-  p) V  ‘  ' (x  - p)| 

where  b  will  be  chosen  later  (it  is  essentially  arbitrary,  but  we 
need  a  consistent  choice),  and  c  is  the  constant  that  lets  the 
family  integrate  to  one  (we  will  never  need  to  compute  it).  The 
maximum  likelihood  criterion  fore.stimating  this  from  an  i.i.d. 
sample  of  n  random  vectors  is 


n  pj 

min^-logdctV+l^^  |xy-p)^F~’(xy-p| 

H.V  2 

The  solution  p  to  this  problem  will  constitute  our  definition  of 
an  affine  Lf  location  statistic.  The  following  partial  result  is 
immediate:  forafixcdnonsingularFasolutionforpexisusand 
the  collection  of  solutions  is  convex.  If  the  observations  do  not 
lie  in  a  hyperplane,  then  for  fixed  p  a  solution  V  exists.  Any 
joint  solution  is  affine:  the  transformed  solution  is  a  solution 
for  the  transformed  sample. 

Notice  that  an  extension  of  this  definition  to  one  for 
disuibutions  is  immediate — simply  replace  the  .sums  with 
expectations.  If  the  random  vector  possesses  elliptical  symme¬ 


try,  then  p,  if  it  exists,  is  the  center  of  symmetry.  V,  if  it  exists, 
is  a  multiple  of  the  quadratic  form  that  characterizes  the 
elliptical  symmeuy. 

We  are  left  to  decide  on  an  appropriate  value  for  b.  From 
the  definition,  it  is  clear  that  this  decision  has  no  effect  on  the 
definition  of  p.  However,  a  definite  solution  for  V  will  be  useful 
in  various  inferences  about  our  model,  and  b  scales  V.  If  our 
goal  were  primarily  robustification  of  the  normal  model,  we 
would  choose  the  constant  so  that  V  coincided  with  the  cova¬ 
riance  of  a  multivariate  normal  random  variable.  For  the 
general  problem  of  an  appropriate  definition  of  LP  scale,  the 
normal  family  plays  no  special  role.  Therefore,  I  propose  the 
following  criterion  for  deciding  b:  For  a  distribution  uniform 
on  the  unit  sphere,  let  V'  be  idcnucal  to  the  ordinary  covariance 

matrix;  that  is,  ,  where  /  is  the  identity  matrix.  Then  our 
scale  behaves  predictably  on  the  simplest  affine  family,  one 
out  of  which  all  others  may  readily  be  built.  A  calculation  gets 


ill.  The  M-median 

The  case  p=l  of  a  multivariate  L'’  location  estimate  will 
give  us  our  desired  affine  multivariate  generalization  of  the 
median. 

Dellnition;  The  m-mecUan  is  any  vector  p  which  solves 
minS-IogdeiF-i-  VrfY  |[xi-p)''^V’‘'(xi-p|  ^ 

The  definition  for  distributions  is  analogous.  The  m-median  is 
affine,  but  coincides  with  the  L’  median  for  spherically  sym¬ 
metric  data.  In  particular  it  is  the  ordinary  median  in  the  case 
of  one  variable.  Notice  that  it  is  not  defined  if  the  observations 
all  fall  in  a  hypcrplane.  In  that  case,  use  the  definition  that 
applies  to  the  smallest-dimensional  h>'perplane  that  contains 
all  observations. 

One  fact  is  immediate:  given  d+]  noncohyperplanar  vec¬ 
tors,  an  m-median  is  at  the  barycenter  of  the  simplex  they  form. 
For,  we  may  uansform  them  to  the  comers  of  a  regular  simplex, 
where  the  result  follows  from  symmetry,  then  transform  back. 

Milasevic  and  Ducharme  1 1987)  have  shown  that  the  L’ 
median  is  unique  for  noncolinear  data.  But  then  the  m-median 
is  also  unique  in  this  case,  as  we  may  transform  to  the  case 
where  Fis  a  multiple  of  the  identity  and  so  the  two  definitions 
coincide. 

Proposition:  The  relative  efficiency  of  the  m-median  to 
the  mean  in  the  multivariate  normal  case  is 


The  proof  involves  computing  the  expected  square  of  the 
infinitesimal  influence  function  of  the  statistic  after  transform- 
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ing  10  the  spherically  symmetric  case.  This  generalizes  a  result 
of  Brown  [1983]  for  the  L’  median.  Here  are  some  special 
cases: 


Dimension 

Efficiency 

1 

0.6366 

2 

0.7854 

3 

0.8488 

4 

0.8836 

OO 

1.0000 

Thus,  the  m-median  becomes  more  nearly  efficient  in 
higher  dimensions,  and  because  of  its  robustness  (since  the 
influence  function  of  a  point  is  bounded)  is  a  worthy  competi¬ 
tor  to  classical  measures  of  location. 

One  interesting  phenomenon  should  be  noted:  since  the 
m-median  is  covariant  under  arbitrary  linear  changes  of  coor¬ 
dinates,  it  is  afflicted  by  a  sort  of  nonrobustness  in  certain 
cases.  If  all  but  a  few  observations  lie  in  a  hypcrplane,  points 
off  that  hyperplane  may  be  arbitrarily  influential  on  the  coor¬ 
dinates  of  the  m-median  in  the  directions  orthogonal  to  the 
hypcrplane.  This  seems  unavoidable  for  nonuivial  affine  sta¬ 
tistics. 


IV.  Computing  the  Estimates 

In  the  case  p=2,  we  have  closed  form  estimates  for  the 
multivariate  mean  and  the  covariance  matrix.  For  computing 
the  general  affine  LP  location  statistic,  we  need 

Theorem:  A  fixed  point  for  the  affine  LP  location  fitting 
criterion  is  given  by 

n 

V  X. 

1=1 

II  — 

|x,-ji)'^V'  '(x,-jt|''^^2 

n 

I, 

1=1  1 

1 

n 

1/- V 

(x,-|i)(x,-^)''’ 

These  were  derived  by  variation  of  the  parameters.  Our 


algorithm  is  to  obtain  a  starting  estimate  for  p  and  V'  (for 
example,  the  mean  and  covariance),  then  iterate  until  the 
estimates  do  not  change  significantly.  In  a  large  number  of 

trials  this  converged  in  all  cases  where  1  <  p  <  2  .  Our  algo¬ 
rithm  for  computing  the  m-median  is  then  just  the  ca.se  p=  1 . 
Surprisingly,  this  procedure  was  successful  even  in  the  case  of 
the  univariate  median,  though  it  converged  very  slow  ly  and  is 
no  competition  for  the  usual  median  algorithms.  In  higher 
dimensions,  it  usually  converged  moderately  rapidly;  getting 
6  significant  figures  in  perhaps  20  iterations.  The  exceptions  to 
this  were  usually  cases  in  which  p  coincided  with  a  data  point; 
then  convergence  was  very  slow.  Presumably  the  algorithm 
could  be  modified  to  recognize  this  special  case.  Except  in  one 
dimension,  it  seems  to  be  very  unusual  for  the  location  to 
coincide  w  ith  a  data  point. 

Scott  c/rr/  [1978]  report  the  serum  cholesterol  and  triglyc¬ 
eride  levels  for  320  males  who  reported  chest  pain.  The  .sample 
mean  was  cholesterol  216.19  and  triglycerides  179.35,  with  a 
correlation  of  .228.  After  fewer  than  20  iterations  we  found  the 
m-median  was  cholesterol  212.71  and  Piglyceridcs  156.37 
with  a  “correlation”  of  .240.  A  few  very  high  triglycerides 
levels  apparently  distorted  the  typical  value,  and  even  diluted 
the  correlation  slightly.  The  figure  shows  a  sparse  hi.stogram  of 
this  data  set  (with  several  extreme  ca.scs  unfortunately  cen¬ 
sored).  The  digit  2  indicates  the  mean  and  1  the  m-median. 

The  extension  to  UP  multivariate  regression  is  straightfor¬ 
ward;  replacepby  a  linear  model  with  one  or  several  indepen¬ 
dent  variables,  and  V  is  then  a  sort  of  covariance  mau-ix  of  the 
multiple  residuals.  The  first  fixed  point  equation  becomes  a 
system  of  weighted  normal  equations;  our  method  is  thus  a 
special  case  of  iteratively  reweighted  least-squares.  Tests  and 
confidence  statements  for  such  a  method  raise  a  number  of 
interesting  questions,  which  will  be  dealt  with  in  a  later  paper. 
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Serum  Lipids  of  320  Men  with  Chest  Pain 
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Abstract  1.  Introduction 

We  are  validating  the  global  cloud  parameters  derived  from  the  NASA's  Earth  Obs^ing  System  (EOS)  will  generate  vast 
satellite-borne  HIRS2  and  MSU  atmospheric  sounding  quantities  of  data.  Hundreds  of  terabytes  of  data  will  be 
instrument  measurements,  and  are  using  the  analysis  of  these  acquired  from  orbit  to  characterize  the  Earth's  environment 
data  as  one  prototype  for  studying  large  geophysical  data  sets  wi^  the  kind  of  spatial  and  temporal  detail  needed  to  study 
in  general.  The  H1RS2/MSU  data  set  contains  a  total  of  40  climate  change.  Such  high  resolution  is  required  to  properly 
physical  parameters,  filling  25  MB/day;  raw  HIRS2/MSU  data  sample  the  non-linear  impact  of  small-scale  phenomena, 
are  available  for  a  period  exceeding  10  years.  Validation  which  can  make  significant  contributions  to  the  global-scale 
involves  developing  a  quantitative  sense  for  the  physical  budgets  of  heat  and  momentum.  It  is  also  expected  that  the 
meaning  of  the  derived  parameters  over  the  range  of  data  will  be  analyzed  not  just  in  the  uaditional  manner, 
environmental  conditions  sampled.  This  is  accomplished  by  concentrating  on  a  single  data  set  at  a  time,  but  in  new  ways 
comparing  the  spatial  and  temporal  distributions  of  the  derived  that  involve  routinely  comparing  data  sets  from  multiple 
quantities  with  similar  measurements  made  using  other  sources.  Part  of  the  need  to  study  multiple  data  sets  comes 
techniques,  and  with  model  results. .  ..  from  a  growing  appreciation  for  the  importance  to  global 

conditions  of  transports  across  boundaries  such  as  the  air- 
The  data  handling  needed  for  this  work  is  possible  only  with  ocean  interface  (e.g..  Earth  System  Science  Committee, 
the  help  of  a  suite  of  interactive  graphical  and  numerical  1988). 
analysis  tools.  Level  3  (gridded)  data  is  the  common  form  in 

which  large  data  sets  of  this  type  are  distributed  for  scientific  We  are  undertaking  the  validation  of  cloud  paramet^  derived 
analysis.  We  find  that  Level  3  data  is  inadequate  for  the  data  from  the  High  Resolution  Infrared  Radiation  Sounder  2 
comparisons  required  for  validation.  Level  2  data  (individual  (HIRS2)  and  the  Microwave  Sounding  Unit  (MSU) 
measurements  in  geophysical  units)  is  needed.  A  sampling  instruments  aboard  the  NOAA  polar  orbiting  meteorological 
problem  arises  when  individual  measurements,  which  are  not  satellites.  The  instruments  provide  one  of  the  few  global 
uniformly  distributed  in  space  or  time,  are  used  for  the  measures  of  cloud  properties  extending  over  many  years, 
comparisons.  Standard  'intopolation'  methods  involve  fitting  They  are  also  capable  of  obtaining  near-simultaneous 
the  measurements  for  each  data  set  to  surfaces,  which  are  then  constraints  on  the  physical  characteristics  of  the  atmosphere 
compared.  We  are  experimenting  with  formal  criteria  for  and  surface  needed  to  derive  cloud  properties.  One  goal  of 
selecting  geographical  regions,  based  upon  the  spatial  this  work  is  to  learn  about  analyzing  large  geophysical  data 
frequency  and  variability  of  measurements,  that  allow  us  to  sets  in  general, 
quantify  the  uncertainty  due  to  sampling.  As  part  of  this 

project,  we  are  also  dealing  with  ways  to  keep  track  of  Radiances  from  the  HIRS2  and  MSU  instruments  have  been 
constraints  placed  on  the  output  by  assumptions  made  in  the  analyzed  by  Susskind  and  co-workers  using  an  algorithm 
computer  code.  The  need  to  work  with  Level  2  data  introduces  that  accounts  self-consistently  for  the  first-order  physical 
a  number  of  other  data  handling  issues,  such  as  accessing  data  quantities  affecting  the  emergent  radiation  (Susskind  et  al., 
files  across  machine  types,  meeting  large  data  storage  1984;  1987).  The  standard  data  products  are  (1)  monthly 
requirements,  accessing  other  validated  data  sets,  processing  mean  values  for  forty  meteorological  parameters,  including 
speed  and  throughput  for  interactive  graphical  work,  and  effective  cloud  amount  and  effective  cloud  top  height,  on  a 
problems  relating  to  graphical  interfaces.  grid  of  boxes  2  degrees  in  latitude  by  2.S  degrees  in 

longitude,  and  (2)  'daily  data'  with  twice-daily  temporal 
KEY  WORDS:  large  data  sets,  validation,  satellite  sampling,  a  spatial  resolution  of  about  125  km,  and  spacing 
data  analysis  between  points  of  about  250  km.  The  monthly  mean  data 

are  referr^  to  as  a  'Level  3'  (gridded)  product,  and  the  daily 
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data  is  called  a  'Level  2'  product  (individual  measurements 
reduced  to  geophysical  units)  (Space  Science  Board,  1982; 
EOS  Data  Panel,  1986).  TTie  size  of  the  uncompressed 
Level  3  data  is  about  4  MB/month,  whereas  the  Level  2 
product  fills  about  25  MB/day  (750  MB/month). 

By  validation  we  mean  'developing  a  quantitative  sense  for 
the  physical  meaning  of  the  measured  parameters,'  for  the 
range  of  conditions  under  which  they  are  acquired.  Our 
approach  involves:  (1)  identifying  the  assumptions  made  in 
deriving  parameters  from  the  measured  radiances,  (2)  testing 
the  input  data  and  derived  parameters  for  statistical  error, 
sensitivity,  and  internal  consistency,  and  (3)  comparing  with 
similar  parameters  obtained  from  other  sources  using  other 
techniques.  A  study  of  this  type  was  performed  for  sea 
surface  temperature  (Njoku,  1985),  and  our  project  is  one  of 
several  parallel  efforts  currently  underway  to  validate 
different  cloud  climatologies  (e.g.,  Rossow  et  al.,  1985; 
1990).  The  validation  effort  we  are  undertaking  introduces  a 
number  of  problems  that  may  be  of  interest  to  specialists  in 
computational  statistics,  such  as  the  INTERFACE 
community,  as  well  as  to  those  involved  in  research  directly 
related  to  interpreting  large  geophysical  data  sets.  This 
article  summarizes  the  key  data  handling  issues  we  have 
encountered. 


2.  The  Need  for  'Level  2'  Data 

Large  geophysical  data  sets,  such  as  cloud  climatologies,  are 
often  distributed  to  researchers  in  gridded  (Level  3)  form. 
This  can  reduce  the  data  volume  by  orders  of  magnitude 
relative  to  the  parameter  values  for  each  individual  sounding 
(Level  2),  and  provides  the  user  with  a  'spatially  uniform' 
data  product.  For  example.  Figure  lA  is  the  global, 
monthly-mean  cloud  amount  map  for  July  1979  from  the 
HIRS2/MSU  data,  in  the  original  2  degree  by  2.5  degree 
averaging  bins.  All  accepted  cloud  amount  data  from  the 
individual  atmospheric  soundings  that  fell  within  each 
geographic  box  were  summed,  and  mean  and  variance  values 
for  each  box  were  calculated. 

Several  problems  occur  when  using  Level  3  products  for 
validation.  First,  if  only  the  Level  3  parameter  values  and 
associated  variances  are  available,  there  is  no  way  to  assess 
how  much  of  the  reported  variance  is  due  to  inherent  non¬ 
uniformity  of  the  parameter  over  the  averaging  region. 
Essentially,  the  insuoiment  resolution  is  degraded  to  a  scale 
comparable  to  the  box  size,  and  information  originally 
acquired  to  measure  smaller-scale  phenomena  in  both  the 
spatial  and  temporal  domains  is  lost.  For  example,  in  a  2 
by  2.5  degree  box,  the  surface  temperature  may  exhibit 
random  fluctuations  of  half  a  degree  and  may  change 
systematically  by  several  degrees,  whereas  the  box  average 
variance  will  assign  all  the  variability  to  random  error. 


We  encountered  a  second  problem  when  making 
comparisons  among  Level  3  products  with  different  gridding 
schemes.  The  best  concurrent  cloud  climatology  available 
for  comparison  with  the  data  in  Figure  lA  was  derived  from 
the  Temperature  Humidity  Infrared  Radiometer/Total  Ozone 
Mapping  Spectrometer  (THIR/TOMS)  on  the  NASA 
Nimbus  7  satellite  (Stowe  et  al.,  1988;  1989).  The  standard 
THIR/TOMS  Level  3  data  product  was  binned  according  to  a 
global  500  by  500  km  grid  that  is  also  used  for  Earth 
radiation  budget  studies.  The  July  1979  HIRS2/MSU  Level 
3  data,  degraded  using  area-weighted  averaging  to  the 
THIR/TOMS  spatial  grid,  is  shown  in  Figure  IB.  We  then 
resampled  the  degraded  HIRS2/MSU  data  back  to  the  2  by 
2.5  degree  grid,  and  subtracted  it  from  the  original 
HIRS2/MSU  data  (Figure  1C).  Note  that  the  differences  are 
nearly  as  large  as  the  range  of  the  signal,  with  both  positive 
and  negative  values.  The  pattern  of  differences  varies  with 
the  location  of  edges  in  the  original  data,  and  is  modulated 
by  the  relative  position  of  grid  boundaries.  Differences  are 
especially  large  at  high  latitudes,  where  the  spatial 
resolution  of  the  THIR/TOMS  grid  is  much  lower  than  that 
of  the  HIRS2/MSLJ  grid,  and  wherever  there  are  sharp  edges 
generated  by  cloud  patterns,  such  as  in  the  intertropical 
convergence  zone  and  monsoon  areas. 

With  the  Level  2  products,  we  have  access  to  physical 
quantities  at  the  full  resolution  acquired  by  the  instruments, 
and  avoid  introducing  additional  artifacts  into  the  comparison 
between  data  sets.  Level  2  data  are  not  uniformly  distributed 
over  the  surface.  At  low  latitudes  there  are  gores  in  the 
HIRS2  sampling  between  orbits,  whereas  at  high  latitudes, 
the  surface  is  heavily  oversampled.  Data  dropouts  and 
calibration  lines  occur  at  all  latitudes.  The  sample 
resolution  changes  by  more  than  a  factor  of  2  from  nadir  to 
the  limits  of  each  scan.  As  a  first  step  toward  making 
comparisons  among  Level  2  data  sets,  surfaces  that  take 
tKCount  of  non-uniform  clustering  of  data  points  may  be  fit 
to  the  data.  We  have  begun  experimenting  with  locally 
adaptive  surface  fitting  techniques  (e.g.,  Renka,  1988),  and 
are  exploring  the  use  of  methods  that  generate  variance 
surfaces  together  with  each  fitted  surface  (Cresse,  1989,  and 
references  therein). 

Binning,  which  is  traditionally  used  to  make  comparisons 
among  global  data  sets,  is  performed  as  an  automatic 
procedure.  In  using  Level  2  data  for  validating  data  sets, 
geographic  sub-regions  of  the  globe  must  be  selected  for 
surface  fitting,  based  upon  some  criterion  that  evaluates  the 
density  of  points  relative  to  the  size  of  local  gradients  of  the 
parameter  field,  possibly  in  several  directions.  Figure  2 
illustrates  the  role  of  interactive  geographic  subset  selection 
a  part  of  the  software  we  are  assembling  to  perform  the 
H1RS2/MSU  validation.  'HDF  in  this  figure  refers  to 
Hierarchical  Data  Format,  a  transportable  file  format  that 
eliminates  all  but  an  initial  file  conversion  for  exchanging 
data  among  DEC,  Sun,  Macintosh,  and  other  machines  used 
in  the  validation  (NCSA  Software  Tools  Group,  1990). 
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This  allows  us  to  store  single  copies  of  data  files  on 
centrally  located  disks,  that  are  accessible  across  the  network 
to  machines  with  differing  architectures.  We  are  currently 
investigating  the  criteria  for  accepting  subsets,  choice  of 
method  for  surface  fitting,  and  methods  for  maldng  formal 
comparisons  among  surfaces  fitted  to  data  from  different 
sources.  The  important  question  of  interpolation  in  the 
temporal  domain  we  set  aside  for  the  present. 

To  summarize:  in  spite  of  the  much  larger  volume  of  the 
Level  2  data,  relative  to  Level  3,  and  the  collection  of  issues 
related  to  the  spatial  and  temporal  sampling  of  Level  2  data 
we  need  the  ability  to  access,  store,  and  process  Level  2  data 
for  (1)  studies  of  the  internal  consistency  and  precision  of 
the  data  set  and  (2)  comparisons  with  other  cloud 
climatologies,  that  are  involved  in  the  validation  of  the 
HIRS2/MSU  cloud  parameters.  We  anticipate  that  similar 
needs  will  arise  for  interdisciplinary  process  studies,  and  in 
work  directed  toward  using  observations  to  better  understand 
mesoscale  climatological  phenomena. 

3.  Tracking  Assumptions  in  the  Code 

Another  issue  that  bears  upon  the  degree  to  which  we  may 
perform  validation,  and  other  scientific  analysis  on  large 
^ta  sets,  is  our  ability  to  grasp  the  collection  of  constraints 
imposed  on  parameter  values  by  the  code  that  generates 
them.  An  assumption  embedded  in  a  large  data  handling 
code  may  produce  results  that  hide  important  information  in 
the  data,  or  may  produce  patterns  in  the  data  that  could  be 
incorrectly  interpreted  as  scientifically  meaningful. 

We  are  experimenting  with  methods  of  charting  the 
collection  of  assumptions,  as  a  way  of  calling  the  attention 
of  the  user  to  areas  where  the  code  may  influence  the  output 
parameters.  We  are  using  standard  charting  symbols  as 
much  as  possible  (e.g.,  Yourdon  and  Constantine,  1979). 
An  example  of  this  type  of  chart  is  Figure  3.  This  shows 
the  flow  of  control  and  the  flow  of  assumptions  made  in  a 
relatively  small  part  of  the  HIRS2/MSU  analysis  code  that 
produces  Level  3  data  from  Level  2  products.  This  chart 
made  clear  the  number  and  complexity  of  the  assumptions 
involved  in  generating  Level  3  products,  and  it  played  a  role 
in  our  assessment  of  the  value  of  Level  3  data  for  the 
validation  exercise. 

Charting  the  flow  of  control  provides  a  needed  context  for 
the  constraints  placed  on  the  data.  These  charts  take  a  step 
in  the  direction  of  making  it  possible  to  keep  track  of 
assumptions,  but  they  do  not  eliminate  the  work  involved  in 
carefully  assessing  the  meaning  of  derived  parameters. 

4.  Conclusions 

The  HIRS2/MSU  cloud  parameter  validation  effort  raises  a 
number  of  data  handling  issues  that  are  likely  to  arise 
frequently  when  scientific  analysis  is  attempted  on  large 


geophysical  data  sets.  We  need  Level  2  data  (individual 
measurements  in  geophysical  units)  (A)  to  perform 
comparisons  among  data  sets  with  different  sampling,  and 
(B)  to  understand  the  effects  of  spatial  and  temporal 
sampling  on  the  'average'  values  obtained  from  a  single  data 
set  The  need  for  Level  2  data  severely  complicates  data 
handling.  Among  the  areas  where  advances  would  be  most 
helpful  are: 

1.  Surface  fitting  software  for  data  distributed  non-uniformly 
in  2-dimensional  space,  and  ways  to  obtain  some  measure  of 
the  associated  variances. 

2.  Software  for  making  formal  comparisons  among  fitted 
surfaces  from  several  sources,  and  their  associated  variance 
surfaces. 

3.  Ways  of  documenting  software  and  data  files  so  they  may 
be  exchanged  and  used  by  others  easily. 

4.  Ways  of  documenting  the  assumptions  embedded  in 
retrieval  and  processing  algorithms,  so  a  researcher  studying 
the  data  products  can  grasp  the  collection  of  constraints 
placed  on  the  output  data  by  the  code. 

5.  Additional  ways  of  storing  data.  For  a  given  Level  2 
data  product,  we  need  readUy  accessible  data  storage  capacity 
of  between  one  and  two  orders  of  magnitude  the  size  of  the 
basic  data  set,  for  intermediate  and  derived  products  that  are 
created  as  part  of  the  validation. 

Several  longer-term  needs  include: 

6.  The  development  of  validation  procedures  that  are  easy 
enough  to  apply  so  that  it  will  be  feasible  to  generate  and 
scccss  a  large  number  of  validated  geophysical  data  sets  for 
interdisciplinary  studies  of  all  types. 

7.  Ways  of  fitting  surfaces  to  data  values  distributed  non- 
uniformly  in  2-dimensional  space  and  in  time,  and  obtaining 
a  measure  of  the  associated  variances. 

8.  Better  ways  of  discovering  patterns  and  surprises  in  high¬ 
dimensional  data  sets. 

9.  Ways  of  fitting  hyper-surfaces  to  higher  dimensional  data 
sets,  and  techniques  for  studying  them. 

We  have  described  our  data,  the  collection  of  problems  we 
are  facing  in  the  validation  work,  and  our  approaches  to 
some  of  these  issues.  Solutions  or  partial  solutions  may 
exist  to  some  of  the  problems  that  are  not  widely  known 
outside  specialized  data  handling  and  computational  statistics 
communities.  We  hope  to  stimulate  experts  in  these  fields 
to  participate  in  the  effort  to  improve  our  understanding  of 
Earth  through  the  study  of  large,  geophysical  data  sets. 
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Figure  1.  The  Effect  of  Rebinning  on  Global  Cloud  Amount 
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Figure  2,  Level  2  Data  Analysis  Software 
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Figure  3.  HIRS2  Level  2  to  3  Software  Overview  /  Assumptions 
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Figure  3.  HIRS2  Level  2  to  3  Software  Overview  (Continued) 
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Abstract 

Inference  for  a  canonical  parameter  in  the  pres¬ 
ence  of  nuisance  parameters  usually  requires  high 
dimensional  integrals  to  obtain  the  marginal  or 
conditional  tail  probabilities.  A  simple  and  very 
accurate  method  is  proposed  to  obtain  any  arbi¬ 
trary  level  of  significance  for  the  parameter  of  in¬ 
terest.  This  method  only  requires  a  fine  tabulation 
of  the  canonical  parameter  and  the  corresponding 
observed  likelihood  function,  which  c<m  be  either 
the  full,  marginal  or  conditional  observed  likelihood 
function,  as  input,  and  produces  the  left  tail  prob¬ 
abilities  at  the  observed  data  value  as  output.  Ap¬ 
plications  of  this  method  to  some  widely  used  engi¬ 
neering  statistical  models  will  be  discussed. 

IV-' 

\ 

1.  Introduction 

A  very  accmate  approximation  to  the  density 
of  the  mean  of  a  sample  of  independent  and  iden¬ 
tically  distributed  observations  was  introduced  to 
statistics  by  Daniels  (1954).  This  approximation 
is  generally  referred  to  as  the  saddlepoint  approx¬ 
imation.  It  focuses  on  an  approximate  conversion 
of  a  cumulant  generating  function  to  a  correspond¬ 
ing  density  function.  However,  it  was  not  until  the 
appearance  of  the  discussion  paper  by  Bairndorff- 
Nielsen  &  Cox  (1979)  that  the  importance  and  use¬ 
fulness  of  this  method  became  well  known.  Since 
then,  many  statistical  applications  of  the  saddle- 
point  approximation  have  been  developed. 

In  many  applications,  it  will  be  of  interest  to 


compute  approximate  tail  probabilities  or  cumula¬ 
tive  distribution  functions,  rather  than  densities.  A 
very  accurate  tail  probability  approximation  for  the 
sample  mean  derived  by  the  saddlepoint  method 
was  obtained  by  Lugannani  &  Rice  (1980)  and  fur¬ 
ther  discussed  in  Daniels  (1987).  Let  (a;i,...,z„) 
be  a  sample  of  observations,  each  with  cumulant 
generating  function  c{(p).  Then  the  Lugannani  & 
Rice  formula,  which  approximates  the  distribution 
function  for  the  sample  mean,  i,  takes  the  form 

r(i)=»*(r)  +  ^(2){l-i)  (1) 

where  <^(*)  and  #(•)  are  the  density  and  distribution 
functions  of  a  standard  normal  distribution, 

z  =  sgn(C){2n[^i  -  c(^)]}^/2  (2) 

C  =  (3) 

with  ip  satisfies  c'{ip)  =  x,  and  c'{<p)  and  c"{<p)  de¬ 
note  the  first  and  second  derivatives  of  c{p).  If 

the  sample  mean  is  equcd  to  the  true  mean,  the 
Lugaimani  &  Rice  formula  is  imdefined;  however, 
Daniels  (1987)  provides  a  formula  to  handle  this 
situation.  Since  our  main  concern  is  the  tail  prob¬ 
abilities,  therefore  we  will  not  discuss  Daniels’  for¬ 
mulation  in  this  paper. 

A  detailed  review  of  the  saddlepoint  methods  in 
statistics  is  given  by  Reid  (1988). 

In  Section  2,  a  numerical  prograim  that  uses  the 
observed  likelihood  function  as  input  and  outputs 
the  significance  function  for  a  real  parameter  of  in¬ 
terest  is  developed.  Some  reliability  models  are 
used  to  illustrate  the  accuracy  of  the  procedure. 
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Section  3  examines  how  the  preceding  numericcil 
procedure  can  be  applied  to  models  with  a  scalcir 
par2imeter  of  interest  and  the  pressence  of  nuisance 
parameters.  Some  concluding  remeirks  are  recorded 
in  Section  4. 


2.  Converting  Observed  Likelihood  Function 
to  Significance  Function 

Let  us  denote  the  observed  log  likelihood  func¬ 
tion  of  the  model  f{x\  6)  at  an  observed  data  point, 
I***,  up  to  an  additive  constant  ,  be 

Also,  denote  the  significance  function  as 

p{e)  =  p{x  <x^‘-,e), 

which  is  the  probability  to  the  left  of  the  data  point, 
x‘**.  A  (1  -  a)  X  100%  confidence  interval  for  6 
can  be  obtained  from  the  significance  function  by 
(p-^l  -  a/2),p-^(a/2)). 

The  aim  of  this  section  is  to  illustrate  how  to 
convert  an  observed  likelihood  function,  or  equiva¬ 
lently  an  observed  log  likelihood  function,  to  a  sig¬ 
nificance  fimction. 

2.1.  Exponential  model 

For  an  exponential  model 

J{x\d)  =  exp{t9  -  c{0)  -t-  h{x)} 

with  canonical  parameter  6  and  minimal  sufficient 
statistic  t  =  t{x),  the  observed  log  likelihood  func¬ 
tion  at  the  data  value,  x°^‘  is 

1(9)  =  t"^‘9  -  c{9) 

where  =  t{x°^*).  Then  the  Lugannani  &  Rice 
formula  gives  the  significance  function 

p{9)  =  P{X<x°^‘-,9) 

=  P{T  <  t‘*’-,9)  =  P(0  <  r'’*; 9) 

=  -}-t-0(n-=’/2)  (4) 

r  q 


where 


sgn(g){2[l(0‘^‘)-l(0)]}^/^ 

(0“^*  -9){j{9^*)y^^ 


d^i{9) 

d9^ 


,gohM 


(5) 

(6) 


The  accuracy  of  of  (4)  in  nuisance  param¬ 

eter  case  is  discussed  in  Barndorff-Nielsen  &  Cox 
(1989),  Daniels  (1987),  and  Fraser  &  Reid  (1990). 


Fraser,  Reid  &  Wong  (1991)  showed  that  with 
a  numerical  tabulation  of  l{9)  over  a  equally  and 
finely  spaced  grid  of  9  in  steps  of  ±8,  and  succes¬ 
sive  divided  differences 


hi9)  =  {I{e  +  S)-li9)}/5  (7) 

h{9)  =  {l,(e)-h{9-8)}/S,  (8) 

(6)  can  be  approximated  by 

g«(0°''*-0){-/2(0‘*’}'/^  (9) 

Thus  the  significance  function  can  be  obtained  form 
converting  the  observed  likelihood  function  by  us¬ 
ing  (4)  with  (5)  and  (9). 


Example  1:  The  gamma  distribution  has  wide  ap¬ 
plication  in  environmetrics  and  reliability.  We  con¬ 
struct  a  simple  example  where  the  scale  parameter 
is  1  and  the  shape  parameter  is  the  parameter  of 
interest.  Consider  a  sample  of  size  1,  the  density  is 

/(z:;0)  =  r-^(0)e-V-^ 

on  (0,  00).  Consider  the  data  value  1°*’*  =  10.  The 
observed  log  likelihood  function  is 

/(0)  =  0iog(io)-iog(r(0)). 

By  using  (4)  with  (5)  and  (9),  the  significance  func¬ 
tion  is  obtained. 
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Moreover,  we  can  also  obtain  the  signifi¬ 
cance  function  from  some  standard  approximations 
namely  the  maximum  likelihood  estimate, 

the  score  statistic, 

-  ^(0,1), 

cind  the  signed  square  root  of  the  likelihood  ratio 
statistic, 

sgn((9‘^*  -  e){2[i{B^*)  -  ^  -^^(0,1). 

Figure  1  plotted  the  significance  functions  ob¬ 
tained  by  the  4  approximations  and  the  exact  signif¬ 
icance  function  obtained  by  exact  integration.  It  is 
not  surprising  that  the  proposed  method  is  more  ac¬ 
curate  than  the  3  standard  approximations  because 
it  is  a  third  order  asymptotic  method,  whereas  the 
others  are  only  first  order  methods.  Furthermore, 
the  first  order  methods  depend  heavily  on  the  nor- 
m2ility  assumption,  which  cleeirly  does  not  hold  here 
because  of  the  fixed  left  boundary. 

2.2.  Location  model 

Consider  the  simple  location  model 

f{x;e)  =  f{x-e). 

The  observed  log  likelihood  function  at  x°^*  is 

m  =  \og{f{xO^^-,e))^lix^^-9). 

Fraser  (1988)  and  DiCiccio,  Field  &  Fraser  (1990) 
showed  that  for  this  model,  the  significance  func¬ 
tion,  p(0)  =  P(X  <  x***;  0),  can  be  obtained  by  (4) 
with  (5),  and  (6)  is  replaced  by 

,  =  S(«){i(«^-)}-'«. 

Moreover,  by  applying  (7)  and  (8),  we  have 

(10) 


Thus,  p{0)  can  be  obtained  by  (4)  with  (5)  and  (10). 

Example  2:  Consider  the  location  gamma  model 
with  the  shape  parameter  is  known.  For  this  exam¬ 
ple,  we  choose  the  shape  par£imeter  to  be  3.  With 
the  sample  size  is  1,  the  model  has  density 

with  X  >  0.  With  the  observed  data  x°^‘  =  1,  the 
observed  log  likelihood  function  is 

l{0)  =  21og(l-0)-|-5. 

By  using  (4)  with  (5)  and  (10),  the  significance 
function  is  obtained  and  compared  with  the  first 
order  methods  and  is  shown  in  Figure  2.  Again, 
the  proposed  method  out-performed  the  standard 
approximations. 

3.  Conditional  and  Marginal  Inferences 

In  the  preceding  section,  we  have  discussed  the 
conversion  of  an  observed  likelihood  function  to  a 
significance  function  for  scalar  parameter  models. 
Now,  let  us  examine  some  multiparameter  models. 

Let  0  =  (^,  A)  with  a  scalar  parameter  of  in¬ 
terest  xp  and  nuisance  paraimeter  A.  Om  aim  is  to 
approximate  either  the  conditional  or  the  marginal 
observed  likelihood  function  for  rp  such  that  the  sig¬ 
nificance  fimction,  p{xp),  can  be  obtained  by  the  nu¬ 
merical  procedure  described  in  the  previous  section. 

3.1.  Exponential  model 

Consider  an  exponential  model  with  canonical 
parameter  0  =  {xp,  A)  where  xp,  a  scalar  parameter, 
is  our  parameter  of  interest.  The  density  has  the 
form 

f{x;  0)  -  exp{xpti  -f  X't2  -  c{xp.  A)  -|-  /i(i)} 
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where  (<1,^2)  =  (*i(®)»*2(*))  Is  the  minimal  suffi¬ 
cient  statistic. 


The  conditional  distribution  of  ti  given  t2  is  free 
of  the  nuisance  puameter  and  would  typically  be 
used  for  inference  about  xp  in  the  absence  of  knowl¬ 
edge  of  A.  An  approximation  to  the  ob¬ 

served  likelihood  from  this  conditional  distribution 
is  given  in  Cox  &  Reid  (1987)  cind  Fraser  &  Reid 
(1990);  and  the  approximated  observed  conditional 
log  likelihood  function  takes  the  form 

l(V’)  «  1(V’,  A^)  +  ^  log A^)|  (11) 


where  A^  is  the  m3iximum  likelihood  estimate  of  A 
for  a  fixed  rp,  and  jxx{^i^il>)  is  the  observed  infor¬ 
mation  concerning  A  for  a  fixed  xp.  By  tabulating  xp 
and  l{xp),  the  significance  function,  p{xp)  =  P(Ti  < 
t2'*’,xp),  can  be  obtained  by  (4)  with  (5)  and 
(9).  Fraser  &  Reid  (1990)  showed  that  this  conver¬ 
sion  is  0(n“®/^)  based  on  the  conditional  distribu¬ 
tion  of  ti  given  <2* 


Example  3:  Consider  the  Proschan  (1963)  data 
which  recorded  the  times  between  successive  fail¬ 
ures  of  air  conditional  equipment  in  13  Boeing  720 
aircrafts.  For  aircrcift  number  7909,  the  data  is 
recorded  in  Keating,  Glaser  &  Ketchum  (1990). 
The  model  being  considered  is  the  two  parameter 
gamma  model  with  density 


r  ^{nxp)\  exp{-t2/)x  +  nxpt2}  X 
T{nxp)r~”(xp)exp{-nxplog(t2)  +  xpti}, 


where  n  =  =  118.8084,  and  if*  =  2422. 

From  (11),  the  conditional  observed  likelihood  func¬ 
tion  is  obtained.  Keating,  Glaser  &  Ketchum 
(1990)  produces  various  tables  to  obt£iin  the  ob¬ 
served  level  of  signiftcance.  In  peirticular,  they 
tested  if  the  gamma  distribution  has  an  increas¬ 
ing  failure  rate  {Hq  :  xp  =  1  versus  Hi  :  xp  >  1) 
and  they  reported  the  observed  level  of  significance 
associated  to  the  test  is  3.84%. 


Wong  (1991)  applied  the  proposed  procedure 
and  obtained  the  observed  level  of  significance  as 
3.85%.  The  advantage  of  the  proposed  procedure 
is  its  efficiency  and  simplicity,  and  the  ability  to 
obtain  arbitrary  level  of  significance  from  the  sig¬ 
nificance  function. 

3.2.  Transformation  model 

Consider  a  general  location  model 


/(*;^)  =  fih  -  V’,^2  -  A). 


The  marginal  density  of  ti  is  free  of  A  and  would 
typically  be  used  for  inference  concerning  xp  in  the 
absence  of  knowledge  of  A.  Fraser  &  Reid  (1990) 
showed  that  for  this  model. 


l{xp)  ss  l{xp,  A^,)  -  ^  log  |;aa(V',  K)  (12) 


is  an  approximation  of  the  observed 

marginal  log  likelihood  frmction  based  on  the 
marginal  density  of  ti.  Again  by  tabulating  xp  and 
l{xp),  the  significance  function,  p{xp)  =  P{Ti  < 
t^*\xp)y  can  be  obtained  by  using  (4)  with  (5)  and 
(10).  This  conversion  from  observed  likelihood  to 
significance  fimction  is  also  shown  in  Fraser  &  Reid 
(1990)  to  be  an  0(n“^/^)  approximation. 


We  can  now  consider  the  location-scale  model, 


fix;0)  = 


where  p  is  the  location  parameter  and  a  is  the  scale 
parameter.  The  sampling  density  can  be  written  as 


) 


e 


log(<7) 


where  is  the  sample  variance,  and  p  and  7  = 
log(<7)  are  location  parameters.  Hence  the  joint  ob¬ 
served  log  likelihood  function  can  be  written  as 

=  -«7  +  J2^og{f({xf*  -  p)e-'')). 
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By  (12),  we  can  tabulate  the  corresponding  approx¬ 
imated  marginal  likelihood  function  and  thus  the 
significance  ftmction  can  be  obtained. 

Example  4:  Let  (*1, . . ar®  ”  observations  sam¬ 
pled  from  a  WeibuU  population  with  density 

'«=(9(ir-j-(f)')- 

Let  p  =  log(a),  a  =  l/fi  and  yi  =  log(a;,-).  Then 
Vi  =  p-\-  (TZi  where  2,-  has  the  extreme  value  distri¬ 
bution  with  density 

/(x)  =  exp{z  -  e*}. 

In  other  words,  the  WeibuU  distribution  can  be  ob¬ 
tained  from  a  location-scale  transformation  of  the 
extreme  value  distribution.  Thus  the  joint  observed 
log  Ukelihood  function  is 

Ki‘n)=  -  "7  +  "(S'*'  -  >■)«■'' 

From  (12),  we  can  tabulate  the  mairginal  observed 
Ukelihood  function  for  p  and  7  separately. 

The  above  model  is  appUed  to  the  ZiebUen  & 
Zelen  (1956)  data,  recorded  in  Fraser  (1979,  page 
33).  Tables  1  and  2  compared  the  90%,  95%  2md 
99%  confidence  intervals  for  p  and  <7  obtained  by  ex¬ 
act  integration,  which  is  recorded  in  Fraser  (1979) 
and  the  proposed  method.  Agciin,  it  shows  that  the 
numericcil  procedure  is  very  accurate. 

4.  Conclusion 

In  this  paper,  we  required  the  par£Lmeter  of  in¬ 
terest  be  a  scalar  canonic£il  peirameter.  However, 
if  it  is  not  the  case,  Fraser  &  Reid  (1990)  de¬ 
rived  a  method  to  extract  the  c2inonical  parame¬ 
ter  from  the  observed  Ukehhood  function.  More¬ 
over,  we  can  extend  the  method  described  in  Sec¬ 
tion  3  to  the  non-normal  regression  model.  FinaUy 
a  generic  computer  program  has  been  developed  for 


this  numerical  procedure.  It  requires  a  finely  and 
equaUy  spaced  tabulation  of  the  canonical  param¬ 
eter  and  its  observed  Ukehhood  function  as  input, 
and  produces  the  corresponding  significance  func¬ 
tion  as  output.  The  prograim  is  available  from  the 
author  upon  request. 
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Table  1:  Confidence  intervals  for  /i 
(in  units  of  y) 


Exact 

Approximation 

90% 

(4.221,  4.590) 

(4.2205,  4.5891) 

95% 

(4.182,  4.627) 

(4.1811,  4.6264) 

90% 

(4.099,  4.701) 

(4.0989,  4.7041) 

Table  2;  Confidence  intervals  for  a 
(in  units  of  y) 


Exact 

Approximation 

90% 

(0.386,  0.659) 

(0.3853,  0.6589) 

95% 

(0.369,  0.700) 

(0.3689,  0.6999) 

90% 

(0.340,  0.792) 

(0.3398,  0.7918) 
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Abstract 

As  greater  computing  power  becomes  routinely  available  to  researchers,  analyses  based  on  Bayesian 
or  likelihood  methods  become  easier  to  perform,  especially  since  the  increase  in  computing  power 
has  been  accompanied  by  development  of  inventive  statistical  algorithms  for  inference.  We  consider 
here  the  nonlinear  regression  model  but  these  approaches  to  inference  are  applicable  in  more  general 
circumstances  and  we  feel  the  comparisons  will  remain  useful.  Several  methods  can  be  used  for 
inference  in  nonlinear  regression;  propagation  of  errors,  likelihood  profiles,  approximate  marginal 
likelihoods  and  posteriors,  and  Monte  Carlo  methods  such  as  importance  sampling  and  the  Gibbs 
.sampler.  These  methods  var  y  in  computing  intensity  and  in  their  ability  to  handle  poorly  conditioned 
situations.  Furthermore,  since  some  of  these  methods  have  only  been  recently  developed,  it  is  not 
easy  for  the  practitioner  to  compare  them  and  choo.se  between  them  because  they  are  not  widely 
implemented.  We  demonstrate  the  respective  merits  of  these  methods  in  a  small  but  instructive 
example.  , 

Keywords:  Nonlinear  Models;  Profile  Likelihood;  Importance  Sampling;  Gibbs  Sampler,  Appr  oxi¬ 
mate  Marginalization 
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1.  Our  Motivating  Example 

Electron  Spectroscopy  for  Chemical  Analysis  (ESCA)  is  a 
key  technique  at  the  Engineering  Research  Center  for  Plasma 
Aided  Manufactur  ing.  University  of  Wisconsin,  to  study  the 
chemical  bonding  strorcture  of  polymer  surfaces.  In  our  case, 
the  same  material,  a  deposited  polymer',  will  be  examined 
.several  times  over  a  period  of  weeks,  and  the  exper  imenters 
want  to  know  how  the  bond  structure  changes.  A  plot  of  the 
data  from  a  spectroscopic  analysis  of  one  sample,  along  with 
fitted  components  and  residuals,  is  shown  in  Figure  I . 

The  immediate  objective  of  the  analyst  is  to  resolve  these 
data  into  a  known  number  of  peaks,  each  of  the  form 


Here  the  par  ameter  0j  is  the  center  (location)  of  peak  j,  jj  is 
the  bandwidth  at  half  the  peak  height,  pj  is  the  proportion  of 
peak  j  in  the  form  of  a  Gaussian  curve  (hence  1  -  pj  is  the 
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Figure  1:  Observed  electron  intensities  (')  versus  bind¬ 
ing  energy  for  the  Carbon  1S  peak  in  plasma  polymer¬ 
ized  methyl  methacrylate  (PPMMA).  Also  shown  are 
the  fitted  spectrum  (solid  line),  its  components  (dashed 
lines),  the  baseline  (dot-dashed  line),  parameter  esti¬ 
mates  using  weighted  nonlinear  least  squares,  and  the 
residuals. 
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proportion  of  the  peak  j  in  the  form  of  a  Cauchy  curve),  and 
Oj  is  the  peak  height. 

Fitting  four  such  peaks  to  these  data  using  weighted  least 
squares  produces  a  fit  as  in  Figure  1 .  Weights  were  used  to 
accommodate  systematic  differences  in  the  variance  of  the 
response.  Moreover,  for  this  particular  fit  a  strong  prior  was 
enforced  on  the  spacings  between  the  peak  locations,  because 
these  spacings  aie  known  fairly  well  for  this  polymer.  Without 
the  prior  the  problem  would  be  overparameterized  and  the 
parameters  would  be  unidentifiable. 

Important  chai  acteristics  of  the  example  aie  that  it  employs 
a  nonlinear  statistical  model  with  rather  precise  data  and  a 
reasonable  understanding  of  the  mechanism  under  study,  and 
that  the  model  parameters  carry  physical  meanings. 

Specifically,  interest  centers  on  the  relative  heights  of  the 
peaks,  which  are  related  to  the  relative  concentrations  of 
the  corresponding  chemical  bonds.  This  requires  careful 
inference  about  some  of  the  parameters,  the  peak  heights, 
and  of  functions  of  them,  while  the  others  enter  as  nuisance 
parameters.  Such  inference  is  notoriously  difficult,  and 
although  we  have  been  working  on  the  ESCA  problem  for 
a  considerable  amount  of  time,  we  have  not  arrived  at  a 
satisfactory  solution  yet.  However,  the  new  methods  which 
we  shall  describe  in  our  paper  seem  powerful  enough  to 
handle  problems  of  this  degree  of  difficulty  in  the  near  future. 

In  Section  2  we  shall  introduce  the  new  methods  and  in 
Section  3  we  shall  demonstrate  how  they  perform  on  an 
example  which  is  much  simpler  than  the  ESCA  problem, 
but  which  di.splays  some  of  the  characteristic  difficulties.  In 
Section  4  we  shall  summarize  our  results. 

2.  Exploring  The  Objective  Function 

For  making  inferences  about  the  parameters  in  a  nonlinear 
model,  we  measure  the  “quality”  of  a  parameter  vector  with 
an  objective  function  such  as  the  residual  sum  of  squares  or 
the  likelihood  or  the  posterior  density.  For  point  estimates, 
we  usually  quote  the  values  of  the  parameters  that  optimize 
the  objective  function.  To  measure  the  variability  of  the 
parameters  (or  the  variability  of  their  estimates)  jointly  or 
individually,  the  most  sensible  and  direct  ways  are  through 
the  objective  function.  Thus  we  want  to  plot  contours  or 
projections  of  contours  of  the  objective  function,  we  want 
to  integrate  the  objective  function  over  nuisance  parameters 
and,  in  general,  explore  how  the  objective  function  depends 
on  the  parameters.  We  may  want  to  do  this  for  the  original 
parameters  or  for  functions  of  these  parameters. 

Several  different  methods  can  be  used  for  exploring  the 
objective  function.  The  simplest  method,  ba.sed  on  a  local 
quadratic  approximation  to  the  objective  function  near  the 
optimum,  is  often  called  the  “propagation  of  errors”  method. 
For  the  nonlinear  regression  model,  a  linear  approximation  to 
the  expectation  function  produces  a  quadratic  approximation 


to  the  sum  of  squares  function  (Bates  and  Watts,  1 988,  chapter 

2) ,  which  is  used  to  form  approximate  standaid  enors  and 
correlations.  For  likelihood  and  Bayesian  analyses  we  usually 
approximate  the  log-likelihood  or  log-posterior  density  at  the 
optimum. 

Propagation  of  errors  is  very  simple  but  often  quite  in¬ 
accurate.  For  greater  accuracy,  two  basic  approaches  to 
exploring  the  objective  function  can  be  used.  These  aje:  I ) 
re-optimizing  the  objective  with  one  of  more  of  the  paiame- 
ters  held  fixed  or  2)  Monte  Carlo  methods  designed  to  create 
a  sample  from  a  density  represented  by  the  objective  func 
tion.  Re-optimization  is  known  as y;rq^/(;ig.  The  Monte  Cat  lo 
methods  include  importance  sampling  (Rubinstein,  1981)  and 
the  Gibbs  sampler  (Gelfand  and  Smith,  1990).  In  his  discus¬ 
sion  of  this  paper,  Luke  Tierney  described  the  use  of  another 
Monte  Carlo  method,  the  Metropolis  algorithm  (Metropolis  et 
al.,  1953).  Hybridmethods,  where  information  from  profiling 
is  used  to  enhance  the  efficiency  of  the  Monte  Cai  lo  methods, 
are  also  possible. 

In  profiling  we  chose  a  parameter,  say  0i ,  and  while  fixing 
it  at  a  value  close  to  but  different  from  the  estimate,  say  0\  - 
optimize  the  objective  with  respect  to  the  remaining  param¬ 
eters.  If  5  represents  the  objective,  the  profiled  objective 
can  be  written  S(0i )  with  the  conditionally  optimal  values  of 
the  other  parameters  written  0_i(Oi).  This  is  repeated  for 
di  —2-6,...  and  ^1  +6,0]  +  2-  6,.. .  until  S  is  sufficiently 
different  from  S{0).  It  produces  three  pieces  of  information: 
1)  the  profiled  value  of  the  objective,  5,  2)  the  conditional 
estimates  of  the  other  parameters,  0_i(f)i),  called  the  pn)- 
file  traces,  and  3)  the  conditional  Hessian  of  the  objective. 
Piece  1)  can  be  used  by  itself  to  define  univai  iate  empirical 
parameter  transformations  as  described  below.  Pieces  1 )  and 

3)  are  used  in  Laplacian  integration  methods  to  approximate 
maiginal  posterior  densities  (Tierney  and  Kadane,  1 986;  Tier¬ 
ney,  Kass,  and  Kadane,  1988)  while  pieces  1 )  and  2)  can  be 
used  to  approximate  projections  of  contours  ( Bates  and  Watts, 
1988,  Appendix  6). 

To  define  the  univariate  empirical  parameter  transform.a- 
tions,  we  note  that  if  the  objective  were  quadratic  in  the  0. 
then  S  would  be  quadratic  in  0\  and 

C((?1 )  -  sign((?,  -  O])yJ.^(O])-S(0)  (2. 1 ) 

would  be  linear  in  0],  For  the  nonlincai'  regression  model, 
dividing  (2.1)  by  .s,  an  e.stimate  of  standaid  deviation  of  the 
disturbance  produces 

r,(0i)  =  sign(t7.  -  d,)yjs{0,)  -  S(0)/s  (2.2) 

a  nonlinear  analogue  of  the  t-statistic  (Bates  and  Watts,  1988, 
chapter  6).  If  objective  being  optimized  is  the  neg.at  .e  of 
the  log-likelihood,  (2.1)  defines  a  nonline;u  analogue  of  a  ; 
statistic.  Whenever  the  objective  is  unimodal.  is  monotone 
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over  the  range  of  interest  and  a  univariate  transformation. 
If  these  transformations  are  used  on  each  parameter,  the 
objective  function  is  much  closer  to  being  quadratic.  To 
examine  the  objective  function  in  the  original  parameters, 
the  transformation  9  ^  t  has  to  be  inverted.  Usually 
the  T  values  are  sufficiently  well  behaved  that  the  forward 
transformation  can  be  defined  by  an  interpolating  spline  but 
the  backward  transformation  has  to  be  defined  with  some 
care. 

One  deficiency  of  the  profiling  methods  is  that  they  give 
good  information  about  the  parameters  chosen  for  the  model 
but  not  about  functions  of  the  parameters.  Especially  for 
Bayesian  analyses,  Monte  Carlo  methods  that  generate  a 
sample  from  the  posterior  provide  a  simple  method  of  eval¬ 
uating  the  behavior  of  functions  of  the  parameters.  The 
primary  advantage  of  these  Monte  Carlo  methods  is  that  they 
change  a  problem  in  parametric  inference  into  a  problem  in 
data  analysis  and  we  have  good  tools  for  data  analysis  in 
several  dimensions. 

3.  The  BOO  Example 

The  model  y,  =  (1  -exp(-(l2  ^))  +  «i  is  to  be  fitted  to 

the  data,  from  Table  A  1.4  of  Bates  and  Watts  (1988,  p.270), 
shown  in  Figure  2.  Note  that  the  variability  is  very  high  and 
that  the  observation  interval  is  too  short  to  capture  the  steady 
state  behavior  with  respect  to  time.  The  small  sample  size  and 
an  unfortunate  experimental  design  cause  pathologies  of  the 
likelihood  surface,  and,  since  there  are  only  two  parameters, 
these  pathologies  can  be  studied  conveniently.  In  practice, 
such  a  problem  should  be  approached  by  improving  the  exper¬ 
imental  design  and  by  taking  more  data;  not  by  overanalyzing 
the  existing  observations.  But  inference  procedures  should 
also  work  in  ill-conditioned  cases  or  at  least  point  to  the 
causes  of  the  ill-conditioning,  so  we  use  the  BOD  example  as 
test  case. 

3. 1 .  Likelihood  Contours 


hm*  {()•/»] 

Figure  2:  Biochemical  Oxygen  Demand  (BOD)  data  and 
fitted  curve 

where  9  is  the  least  squares  estimate  of  ((?i ,  ),  these  contours 

can  be  labeled  by  their  approximate  frequency  content.  Such 
contours,  created  by  evaluating  4>{0)  on  an  equispaced  grid 
of  100  steps  over  [-20, 50]  x  [-2, 6]  are  shown  in  Figure  3. 
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The  likelihood  for  the  BOD  example  is  of  the  form 
L((?i,(?2,<z-|t,y)  =  C  -  exp  ^-^Incr*  - 
with 

n 

s{o,  ,02)  =  ))] '  • 

i  =  l 

hence  contours  of  the  sum  of  squares  .S(0i,(l2)  are  also 
likelihood  contours.  Using  the  approximation 


Figure  3:  Sum  of  squares  contours  for  the  BOD  data. 
Levels  are  chosen  to  give  nominal  coverage  of  80%, 
90%,  95%,  and  99.9%  as  confidence  regions. 

The  contours  indicate  ill-conditioning.  Contours  at  high 
levels  are  open  in  the  O2  direction  and  fold  over  as  O2 
passes  zero.  Since  <?2  is  a  rate  constant,  laige  values  for 
O2  mean  that  the  response  will  increase  rapidly,  reaching 
the  asymptote  almost  instantaneously.  If  O2  is  so  large  that 
the  curve  is  near  the  asymptote  at  the  first  data  value,  the 
sum  of  squares  is  insensitive  to  further  increases  in  O2.  As 
O2  — ►  00,  the  response  changes  instantaneously  from  zero  to 
the  asymptotic  level.  The  T  value  for  this  ca.se  defines  the 
level  above  which  likelihood  contours  will  be  open  in  the  O2 
direction.  Alternatively,  if  O2  passes  zero  from  above,  the 
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model  will  become  locally  overparameterized.  If  the  absolute 
value  of  (?2  is  small,  the  expression  •  [1  -  exp(-t?2  ■  0] 
reduces  to  0i  ■  02  ■  t  and  hence  the  0i  that  minimizes  the 
sum  of  squares  function  for  fixed  02  will  be  approximately 
01  =  CI02  and  will  jump  from  +oo  to  -oo  as  02  crosses  zero. 
In  practice  one  would  tend  to  restrict  02  to  be  positive  so  the 
latter  effect  would  not  occur.  For  our  purpose  of  illustration 
we  will  leave  02  unrestricted  and  on  its  original  scale. 


3.2.  Inference  Based  on  First  Order  Approximations 

The  least  squares  estimates  are  =  19.14  and  02  =  0.53 
with  approximate  covariance  matrix 


6.2296  -0.4323 
-0.4323  0.0412 


(3.1) 


Figure  4  shows  80%,  90%,  95%,  and  99.9%  contours  based 
on  the  approximation 


4>'{e)  = 


\e-efs  \6-e)]ip 
s{e)l{n-p) 


p.n-p* 


theta  1 


Figure  4:  Approximate  80%,  90%,  95%,  and  99.9% 
likelihood  contours  generated  using  the  linear  approxi¬ 
mation  to  the  model  function. 

Clearly,  these  regions  differ  greatly  from  the  likelihood 
regions  displayed  in  Figure  3.  Which  ones  are  right?  The 
answer  is  “both”  and  “neither”.  Both  sets  of  contours  are 
based  on  approximations.  While  the  likelihood  contours  cor- 
re.spond  to  parameter  pairs  which  produce  fits  of  equal  quality, 
measured  by  the  sum  of  squares,  the  ones  obtained  fiom  the 
linear  approximation  are  the  correct  asymptotic  (laige  sample) 
contours  from  a  frequentist’s  point  of  view.  Nevertheless,  in 
this  small  sample  case,  the  likelihood  contours  seem  more 
appropriate  to  us.  An  additional  reason  for  this  is  that  the 
validity  of  the  F  approximation  for  the  likelihood  contours 


is  only  affected  by  intrinsic  nonlinearity  while  the  regions 
from  the  linear  approximation  are  affected  by  both  intrinsic 
nonlinearity  and  parameter-effects  nonlineaiity  (Bates  and 
Watts,  1988,  Chapter?). 

3.3.  Likelihood  Profiles 

The  likelihood  contours  in  Figure  3  were  obtained  by  evalu¬ 
ating  over  a  fine  grid  in  and02-  While  this  approach  is  still 
reasonable  for  two  parameters  and  a  small  region  of  interest, 
the  amount  of  computation  necessary  increases  exponentially 
with  the  number  of  parameters  and  quickly  surpasses  the 
available  computing  power.  Therefore  it  is  desirable  to  have 
methods  which  can  be  used  to  create  good  approximate  like¬ 
lihood  contours  with  a  computing  effort  which  is  linear  in  the 
number  of  parameters.  Profiling  the  likelihood  is  one  of  these 
methods.  We  shall  now  show  how  this  method  performs  in 
the  case  of  the  BOD  example.  Using  the  definition  of  r,  from 
(2.2),  we  computed  a  selection  of  (0\,n)  and  {O2,  T2)  pairs 
over  the  intervals —3.5  <  Ti  <  3.5  and  obtained  approximate 
6  transformations  for  both  parameters  by  spline  interpo¬ 
lation.  Then  we  constructed  approximate  likelihood  contours 
by  generating  ellipses  based  on  the  linear  approximation  in 
the  T  coordinates  and  transforming  them  back  into  6  space. 
Figure  5  shows  the  back-transformed  contours  for  the  levels 
80%,  90%,  95%,  and  99.9%. 
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Figured:  Approximate 80%,  90%,  95%,  and 99.9%  like¬ 
lihood  contours  generated  using  a  linear  approximation 
in  the  r  parameters  and  back-transforming  to  9. 

These  contours  are  already  quite  similar  to  the  ones  com¬ 
puted  with  the  grid  method.  They  can  be  enhanced  further  by 
using  the  profile  traces  as  described  in  Bates  and  Watts  ( 1 988, 
Appendix  6). 

3.4.  A  Note  on  Likelihood  And  Bayesian  Methods 

While  in  likelihood  methods  one  describes  features  of  the 
likelihood  function  (maximum,  contours,  etc)  and  uses  fre- 
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quentist  arguments  to  attach  probability  statements  to  these 
features,  in  the  Bayesian  approach  one  treats  the  posterior, 
whose  main  part  is  the  likelihood,  as  a  probability  distribu¬ 
tion  and  bases  the  entire  inference  on  it.  Using  a  likelihood 
together  with  a  flat  prior  provides  a  bridge  between  the  like¬ 
lihood  and  the  Bayesian  approaches.  However,  a  likelihood 
does  not  always  define  a  proper  posterior  because  the  integral 
over  the  entire  parameter  space  may  be  infinite.  The  BOD 
problem  is  an  example  of  this.  Since  the  high  level  contours 
are  open  for  large  values  of  02,  the  integral  of  the  BOD 
likelihood  is  infinity.  The  methods  we  shall  describe  now  all 
require  a  proper  posterior,  and  therefore  the  BOD  likelihood 
needs  to  be  modified.  One  way  of  doing  this  is  by  restricting 
the  BOD  likelihood  to  a  finite  domain  such  as  range  of  the 
previous  plots.  This  amounts  to  an  indicator  prior  on  the 
rectangle  [—20,50]  x  [—2,6]  and  a  flat  prior  on  <t-.  Note 
that  this  prior  is  chosen  for  the  purpose  of  illustration  only. 
In  real  life,  negative  values  for  O2  are  impossible.  Therefore 
one  should  reparameterize  the  problem  by,  for  example,  in¬ 
troducing  6  =  log  O2  and  possibly  use  a  prior  which  is  locally 
uniform  on  the  expectation  surface  in  the  new  parameters 
(Bates  and  Watts,  Chapter  6). 

3.5.  Importance  Sampling 

Importance  sampling  is  one  of  the  Monte  Cailo  techniques 
for  exploring  posterior  distributions.  Since  we  are  interested 
in  0\  and  02,  we  first  have  to  marginalize  the  posterior  with 
respect  to  The  resulting  posterior  for  Oy  and  02  becomes: 

P{0uO2)  =  C-[S{Ou02)]'^~\ 

with 

C=(f  . 

\,/[-20,50]x[-2,6]  J 

In  importance  sampling  we  create  a  sample  from 

an  approximation  I{0)  to  the  posterior  and  attach  weights  w, 
proportional  to p{0^'^)/ 1{0^'^)  to  it.  Usually  we  normalize  the 
weights  such  that  Wi  =  1 .  These  weighted  samples  can 
then  be  used  as  substitutes  to  samples  from  p(0)  in  forming 
histograms,  integrals,  etc. 

Our  first  attempt  is  to  use  a  multivariate  t  distribution  with 
the  least  squares  estimate  9  -  (19.1426,0.5310)  as  location 
parameter  and  from  (3.1)  as  the  scale  matrix.  The  sample 
of  10000  observations  is  shown  in  Figure  6. 

Unfortunately,  the  highest  weight  is  about  0. 1 7  and  the  sum 
of  the  10  highest  weights  is  0.6.  Thus,  of  the  KXKIO  samples, 
only  10  really  enter  into  any  further  analysis.  This  means 
that  the  Monte  Carlo  variance  for  this  sample  is  very  high  and 
that  the  statistics  computed  from  this  sample  are  essentially 
useless.  Failure  of  importance  sampling  due  to  dominating 
weights  results  from  gross  mismatches  between  the  true 
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Figure  6:  Importance  sample  for  the  BOD  parameters 
from  a  direct  approximation  with  a  multivariate  t  density. 
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Figure  7:  Highest  and  lowest  weights  for  the  importance 
sample  in  the  original  parameters,  “x"  indicates  points 
of  high  weight  and  indicates  points  of  low  weight. 

posterior  p((?)  and  the  imponance  distribution  1(0).  In  the 
case  of  the  BOD  problem,  this  mismatch  becomes  apparent  if 
one  compares  the  contours  of  the  likelihood  with  the  contours 
based  on  the  linear  approximation.  The  contours  of  the 
density  conesponding  to  the  multivaiiate  t  approximation 
follow  the  contours  of  the  linear  approximation;  the  contours 
of  the  true  posterior  follow  the  likelihood  contours.  The 
locations  of  the  10  highest  and  the  1000  lowest  weights  aie 
shown  in  Figure  7.  The  sample  points  with  high  weights  are 
exactly  in  places  where  there  is  still  considerable  posterior 
density  but  the  t  density  is  close  to  zero.  This  means  that 
these  sample  points  are  likely  under  the  posterior  and  raie 
under  1(0).  The  weights  have  to  make  up  for  the  difference. 
In  turn,  the  weights  which  are  essentially  zero  correspond 
to  samples  which  are  likely  under  the  t  distribution  but  raie 
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under 

Since  in  nonlinear  regression,  the  likelihood,  and  subse¬ 
quently  the  posterior,  often  has  strongly  non-elliptical  con¬ 
tours,  direct  importance  sampling  based  on  6  and  S  cannot 
be  recommended.  However,  if  the  likelihood  profile  trans¬ 
formations  are  available,  the  situation  is  better.  Then,  one 
can  conduct  the  importance  sampling  in  the  t  coordinates 
using  f  and  the  transformed  covariance  matrix  S' ,  which  is 
just  the  conelation  matrix  in  the  original  paiameters.  The 
likelihood  contours  in  the  t  coordinates  usually  look  much 
more  elliptical  than  in  the  original  coordinates  and  therefore 
importance  sampling  based  on  multivariate  normal  or  t  dis¬ 
tributions  will  work  better  there.  Importance  sampling  done 
in  T  coordinates  helps  to  eliminate  the  dominating  weights 
for  the  BOD  example.  Figure  8  shows  the  resulting  sample 
points  transformed  back  to  9  coordinates  where  they  trace  the 
likelihood  contours  quite  well. 


Figure  8:  Importance  sample  from  an  approximation 
by  a  multivariate  t  density  in  the  t  parameters  back- 
transformed  to  the  6  parameters. 

Some  caution  is  needed,  however,  when  doing  importance 
sampling  in  r  coordinates.  Using  the  likelihood  in  r  coordi¬ 
nates  with  a  flat  prior  is  not  the  same  as  using  the  likelihood 
in  the  original  coordinates  under  a  flat  prior.  In  this  ca.se  a 
Jacobian  of  the  transformation  may  be  necessary. 

3.6.  The  Gihhs  Sampler 

Rather  than  sampling  from  a  rough  approximation  to  the 
posterior  and  using  weights  to  bridge  the  gap,  one  can  attempt 
to  sample  from  the  posterior  directly.  Gibbs  sampling  is  an 
iieiaiivc  lecimique  foi  doing  so  (Gelfand  and  Smith,  1990). 
However,  Gibbs  sampling  in  its  usual  form  is  not  applicable  to 
nonlinear  regression  since  the  posterior  is  only  known  up  to  a 
multiplicative  constant  and  since  the  conditional  distributions 
are  not  given  explicitly.  Grid  based  Gibbs  sampling  (Ritter 
and  Tanner,  1990)  overcomes  this  difficulty  by  working  with 


approximations  to  the  marginal  conditionals  based  on  evalu¬ 
ations  of  the  posterior  over  one  dimensional  grids.  Figures 
9  and  10  show  how  grid  based  Gibbs  sampling  staiting  with 
500  uniformly  distributed  points  quickly  recovers  the  chaiac- 
teristic  features  of  the  BOD  likelihood.  In  this  example  the 
grid  based  Gibbs  sampler  was  used  in  its  simplest  form  with 
40  equidistant  grid  points  in  both  0i  and  directions.  The 
sample  stabilized  after  only  five  to  ten  iterations.  A  total  of 
40  iterations  were  conducted  but  no  further  changes  could  be 
observed. 
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Figure  9:  Grid-based  Gibbs  sampler  -  starling  sample 


Figure  10:  Grid-based  Gibbs  sample  -  after  5  iterations. 

Grid  ba.sed  Gibbs  sampling  requires  little.  For  example 
it  does  not  require  a  least  squares  estimate  or  a  covariance 
matrix.  However,  it  is  rather  computing  intensive.  In  the 
above  example,  the  10  first  iterations  requited  200, 000  eval¬ 
uations  of  the  posterior  distribution  while  for  the  importance 
sampling  the  po.sterior  was  evaluated  only  10, 000  times.  Yet. 
the  above  implementation  of  the  Gibbs  sampler  was  not  the 
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most  efficient.  For  example,  by  using  flexible  grids  with 
fewer  points,  the  amount  of  computation  can  be  cut  easily  to 
about  100, 000. 

Gibbs  sampling  is  just  one  member  of  a  larger  class  of 
sampling  algorithms  based  on  Markov  chains.  Another 
member  is  the  Metropolis  algorithm  which  has  also  been  used 
successfully  for  the  BOD  problem.  Since  these  algorithms 
are  very  new  as  toots  in  nonlinear  regression,  little  can  be  said 
about  their  respective  strengths.  Further  research  is  needed. 


3.7.  Marginal  Inference 

Often,  one  is  interested  in  individual  components  of  the 
parameter  vector  or  in  functions  of  the  parameters.  Such  in¬ 
ference  is  notoriously  difficult  unless  one  can  resoit  to  Monte 
Carlo  type  methods.  In  this  context  importance  sampling 
and  Gibbs  sampling  show  their  true  strengths  although  they 
require  a  high  computing  effort.  There  are  other  methods  for 
obtaining  approximate  maiginal  distributions,  (Tierney,  Kass, 
and  Kadane,  1988;  Leonard,  Hsu,  and  Tsui,  1989)  which 
require  less  computation.  Recent  work  by  Leonard,  Hsu,  and 
Ritter  reformulates  the  approximating  integrals  in  a  t-type 
setting  and  yields  for  one-dimensional  margins 


PiOj\x,y) 


where  S{0j)  is  as  before  and  is  the  Hessian  of  the 
conditional  sum  of  squares  evaluated  at  the  optimum.  Note 
that  in  usual  applications  of  Laplace-type  approximations,  the 
exponent  of  S  is  — (f  —  1).  Replacing  n  by  ?i  ~  p  takes 
into  account  that  the  t-type  distribution  has  “n  -  p  degrees 
of  freedom".  We  shall  now  demonstrate  and  compare  these 
methods. 


3.8.  Marginal  Distribution  of  Rate  Parameter 

Using  the  previously  introduced  posterior  in  0i  and  O2,  we 
can  compute  the  O2  marginal  by  numerical  integration  or,  in 
this  case,  analytic  integration.  We  will  restrict  ourselves  to 
the  numerical  integration  since  the  direct  integration  is  messy 
and  does  not  reveal  any  interesting  features.  Numerical  in¬ 
tegration  is  easy  and  fast  for  integrating  out  one  dimensional 
parameters  (in  this  case  (7i ),  yet  it  becomes  difficult  and  com¬ 
puting  intensive  if  the  dimension  over  which  the  integration 
is  to  be  conducted  increases.  In  these  situations  Laplace-type 
approximations  and  Monte  Carlo  techniques  are  preferable. 

Figure  1 1  shows  a  comparison  of  the  integrated  O2  marginal 
and  a  marginal  histogram  derived  from  the  combined  Gibbs 
sample  of  iterations  6  through  10.  The  match  is  very  good. 

Figure  12  shows  the  corresponding  picture  for  a  histogram 
derived  from  the  importance  sample  in  the  original  coor¬ 
dinates.  Clearly,  there  are  too  few  points  with  large  O2 
component  and  consequently,  the  corresponding  weights  are 


very  high.  The  spikes  in  the  right  part  of  the  picture  ate 
caused  by  single  observations  with  high  weights. 

Figure  13  shows  the  histogram  for  the  importance  sample 
created  using  the  profile  transforms  and  a  Jacobian.  This 
importance  sample  performs  much  better  than  does  the  direct 
one,  yet  still  not  as  good  as  the  Gibbs  sample. 
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Figure  11:  Marginal  density  for  02-  The  dotted  line  is 
from  numerical  integration  and  the  bars  are  from  pooling 
iterations  6-10  of  the  Gibbs  sampler. 
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Figure  12:  Marginal  density  for  O2.  The  dotted  line  is 
from  numerical  integration  and  the  bars  are  from  the 
importance  sample  in  the  9  parameters. 

Finally,  Figure  14  shows  a  comparison  of  the  integrated 
maiginal  and  the  marginal  obtained  using  t-type  approximate 
marginalization.  To  make  the  latter  maiginal  compai  able  with 
the  integrated  one,  we  set  the  marginal  equal  to  zero  for  the 
points  where  the  conditional  minimum  of  the  sum  of  squares 
fell  outside  the  domain  for  and  O2  and  we  normalized  the 
resulting  curve  to  integrate  to  unity  over  the  O2  domain. 


Approaches  to  Inference  for  Nonlinear  Models  155 


theta  2 

Figure  13:  Marginal  density  for  O2.  The  dotted  line  is 
from  numerical  integration  and  the  bars  are  from  the 
importance  sample  in  the  t  parameters. 

The  shape  of  the  curves  matches  perfectly  for  much  of 
the  range.  As,  however,  O2  approaches  zero,  the  conditional 
minimum  of  S  moves  outside  of  the  domain.  Since  the  con¬ 
ditional  maximization  is  being  done  routinely  in  the  profiling 
algorithm,  the  approximate  margins  can  be  obtained  as  a 
inexpensive  by-product  of  profiling. 

4.  Conclusions 

There  is  still  much  to  be  done  in  compaiing  approaches  to 
inference,  even  for  the  specific  case  of  the  nonlinear  regression 
model.  The  methods  we  described  based  on  profiling  or  on 
Monte  Carlo  approaches  are  feasible  for  small-  to  medium¬ 
sized  problems.  They  show  that  it  is  possible  routinely  to  go 
beyond  quoting  “asymptotic”  standard  en  ors  and  correlations 
for  parameters.  Obtaining  an  importance  sample  is  relatively 
straightforward  but  we  would  recommend  always  using  the 
profile-based  transformations  before  obtaining  the  importance 
sample.  Without  transforming  to  more  stable  paiameters,  the 
Monte  Carlo  efficiency  of  the  importance  sample  can  be 
much  too  low.  The  Gibbs  sampler,  and  other  methods  based 
on  Markov  chains  like  the  Metropolis  algorithm,  are  very 
robust  but  also  very  expensive.  We  found  that  we  did 
have  to  pay  careful  attention  to  the  prior  distribution  of  the 
parameters  when  using  such  methods.  This  may  be  because 
of  pathologies  in  the  small  example  we  were  using  but  we  feel 
it  is  to  some  extent  an  inherent  property  of  the  methods.  Luke 
Tierney,  in  his  discussion  of  this  paper,  had  several  comments 
to  make  about  rea.sonabIe  choices  of  a  prior. 

The  need  to  consider  the  choice  of  prior  caicfully  is  a  “good 
news/bad  news”  type  of  situation.  The  good  news  is  that  you 
are  forced  to  look  at  your  model  and  data  carefully  and  hence 
create  a  more  informed  analysis.  The  bad  news  is  that  “black 
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Figure  14:  Marginal  density  for  Oi.  The  dotted  line 

is  from  numerical  integration  and  the  are  from  the 

Laplace  t  approximation. 

box”-style  automation  of  the  methods  becomes  much  more 

difficult. 
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Abstract 

Markov  chain  Monte  Carlo  (e.  g.,  llie  Metropolis  algo¬ 
rithm  and  Cihhs  sampler)  is  a  general  tool  for  simulation 
of  complex  stochastic  processes  useful  in  many  types  of 
statistical  inforonco.  The  basics  of  Markov  chain  Monte 
Carlo  are  reviewed,  including  choice  of  algorithms  and 
variance  estintation,  and  some  new'  methods  are  intro¬ 
duced.  The  use  of  Markov  chain  Monte  C’arlo  for  max¬ 
imum  likelihood  estimation  is  explained,  and  its  j)er- 
forrnance  is  compared  with  maximtim  pseudo  likelihood 
estimation. 

S 

Key  Words:  Markov  chain,  Monte  ('arlo.  Maximum 
likelihood,  Metropolis  algorithm,  Cobbs  sampler,  Vari¬ 
ance  estimation. 

1  Introduction 

For  many  complex  stochastic  processes  very  little  can 
accomplished  by  analytic  calculations,  but  simulation  of 
the  process  is  possible  using  Markov  chain  Monte  Carlo 
(Metropolis,  et  al.,  ISrj.'l;  Hastings,  1970;  Ceman  and 
Ceman,  1984).  The  simulation  can  Ix'  used  to  calcu¬ 
late  integrals  involved  in  various  forms  of  stati.slical  in¬ 
ference.  Most  work  in  this  area  has  concentrated  on 
Bayesian  inference  (Geman  and  (4eman,  1984;  Gelfand 
and  Smith,  1990;  Be.sag,  York,  and  Mollie,  1991).  But 
Markov  chain  Monte  Carlo  is  a  general  tool  for  simula¬ 
tion  of  stochastic  processes;  it  should  be  useful,  and  has 
been  applied,  in  other  forms  of  inference. 

One  such  area  is  likelihood  inference.  For  comph'x 
stochastic  proce.sses  such  as  the  Markov  random  li  'lds 
(Gibbs  distributions)  used  in  spatial  statistics  (and  other 
areas,  with  Markov  random  fields  defined  on  graphs, 
networks,  pedigrees,  and  the  like)  exact  calculation  of 
the  maximum  likelihood  estimate  (MLE)  is  imiwssible, 
but  several  methods  of  Monte  (larlo  approximation  of 
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the  MLF  have  Ix’i'u  devised.  One  uses  direct  .Mont*' 
Carlo  calculation  of  the  likelihood  (Benttinen,  1981; 
Geyer.  1990:  Gi'yer  and  'Fhompson.  1992).  Anothi'r 
us<'s  stochiisl.ic  a))proximation  (Yoiines,  1988;  Moya'ed 
and  Bad<leley,  1991).  A  third  is  that  of  Ogata  and 
Tanemura  (1989).  Only  th*'  first  of  thi;se  permits  th*' 
computatioti  of  matiy  I'stimates  from  on*'  Monte  Carlo 
siimple  aiul  s*)  pi-rmits  rapi*l  paranu'tric  bootstraj)  c*)m- 
putati*>ns  ami  simulali*)n  studi('s.  'These  are  important, 
ways  of  stinlying  t  h*'  |)r*)p('rti('s  *)f  the  <'stimators,  an*l 
th*'  otlu'r  m*'thods  will  n*)t  be  further  *lisruss*'*l.  Co<l- 
iiig  an*l  maxi.mum  ps('u*lolik<'lihoo*l  ('stimat.es  (Ml’LK) 
(Besag,  1974.  1975)  have  also  Ix'en  use*!  for  such  prob- 
h'tus,  but  tlx's*'  ('stimators  *1*>  n*)t  api>r*)ximate  th*'  MLF, 
*'xc*'pt  in  th*'  limit  of  z*'ro  (h-pi'iuience. 

Monte  Carlo  maximum  likelihood  is  iilustrat*'*!  using 
the  two-parameter  Ising  model  as  an  example.  'This 
model  is  siiiii)!*'  enough  so  t  hat  extensive  simulat  ions  are 
possibi*'  but.  has  most  of  th*'  complexity  of  mor*'  <'lab- 
orate  nuxlels,  in  particvilar,  the  behavior  of  “fret'zing," 
whicli  pn'sent.s  si'ver*'  probh'ins  for  maximum  ps*'u*io- 
likelihood,  but  non*'  for  maximum  lik<'lihood.  MLF  is 
compared  to  MPLF  in  a  case  wher*'  the  random  fi*'!*! 
has  strong  dependetice  (is  n*'ar  fr<'*'zing)  where  the  .su¬ 
periority  of  .MLF  ov<'r  MPLF  is  clearly  shown. 

2  Markov  Chain  Monte  Carlo 

B«'for*'  discussing  the  use  of  Markov  chain  Mont*'  Carlo 
for  maximum  likelihood,  it  is  first  n*'c*'ssary  to  brii'fly 
r*'vi*'w  tlu's*'  Markov  chain  nu'thods,  since  the  litt'rat iir*' 
is  confused  and  contains  some  bad  advic*'. 

Markov  chain  Mont*'  Carlo  is  an  old  method  of  simu¬ 
lation  that  go*'s  back  to  the  dawn  of  th*'  computer  ag*', 
but  which  has  had,  until  recently,  little  application  in 
statistics.  The  main  idea  is  very  simple.  In  ordinary 
Monte  Carlo,  if  on*'  wislx's  t.o  evaluate  an  int*'gral 


wlier*'  /’  is  a  probability  nn'asnri'  and  on*'  has  a  met  hixl 
of  simulating  a  s*'*in*'nr('  A'l,  A'v.  . .  .ofi.  i.  *1.  n'alizat  ions 


Markov  Chain  Monte  Carlo  Maximum  Likelihood  157 


from  P,  the  obvious  estimate  is 

Pnfif  =  (2) 

since 

PnQ^-^Pg  (3) 

by  the  strong  law  of  large  numbers  whenever  g  is  P- 
integrable.  The  notation  in  (1)  and  (2)  is  standard  in 
the  empirical  process  literature  and  very  convenient;  (1) 
treats  the  symbol  P  interchangeably  as  a  measure  and 
as  an  operator,  (2)  treats  the  empirical  measure  (the 
measure- valued  stochastic  process  that  puts  mriss  1/n 
at  each  of  the  points  Xi  in  the  sample)  the  same  way. 
Though  ordinary  Monte  Carlo  is  very  powerful,  it  has 
its  limitations.  In  particular  there  are  no  general  meth¬ 
ods  for  simulating  independent  realizations  of  multivari¬ 
ate  random  vectors  or,  more  generally,  from  complex 
stochastic  processes.  This  difficulty  is  gotten  around  by 
Markov  chain  Monte  Carlo  in  which  one  simulates  not 
independent  realizations  from  P  but  a  Markov  chain  A'l, 
X2,  ■ .  ■  with  stationary  transition  probabilities  having  P 
as  a  stationary  distribution.  If  the  chain  is  irreducible, 
(3)  still  holds,  though  it  is  now  referred  to  as  the  ergodic 
theorem  rather  than  the  strong  law  of  large  numbers. 

Since  a  countable  union  of  null  sets  is  a  null  set,  (3) 
can  be  taken  to  hold  simultaneously  (for  the  same  null 
set  of  sample  paths  of  the  Markov  chain)  for  all  func¬ 
tions  g  in  any  countable  family.  If  the  state  space  of  the 
Markov  chain  (the  sample  space  of  the  measure  P)  is  a 
second  countable  topological  space  (such  as  R*^)  and  the 
countable  family  of  functions  is  taken  to  be  indicators 
of  open  sets  in  the  countable  base,  then,  for  almost  all 
sample  paths  of  the  Markov  chain, 

Pn  1b  — ^  P  1bi  for  all  open  sets  B, 

that  is 

Pn  ^  P  (4) 

(the  empirical  converges  in  distribution  to  the  truth). 

This  is  the  sense  in  which  Markov  chain  Monte  Carlo 
“works.”  The  samples  Xi,  X2,  ■  ■  ■  are  neither  indepen¬ 
dent  nor  identically  distributed,  and  none  has  marginal 
distribution  P  (though  typically  the  marginal  distribu¬ 
tion  of  Xn  is  close  to  P  for  large  n).  They  behave  like 
samples  from  P,  however,  in  the  .sense  that  (4)  holds, 
just  as  if  ATi,  X2,  ■  ■  were  i.  i.  d.  P. 

Some  confusion  in  the  literature  has  resulted  from 
failure  to  understand  this  basic  nature  of  Markov  chain 
Monte  Carlo.  One  sees  described  without  justification 
in  various  places  the  following  way  to  do  Markov  chain 
Monte  Carlo.  Let  A'n,  •  ■  A'mi  be  independent  real¬ 
izations  from  some  distribution.  For  j  =  1,  . . .,  rn,  sim¬ 
ulate  Xj2,  . . .,  Xjn  a  Markov  chain  starting  at  A'ji,  all 


m  chains  having  the  same  transition  probabilities  and 
stationary  distribution  P.  Take 

1 

;  =  i 

as  an  estimate  of  f  gdP.  This  formula,  which  may  be 
referred  to  as  the  “many  short  runs”  school  of  Markov 
chain  Monte  Carlo  (as  opposed  to  the  “one  long  run” 
school)  has  some  problems.  As  m-—  00  (5)  converges  to 
something  by  the  strong  law  of  large  numbers;  it  does 
not,  however,  converge  to  JgdP.  That  would  require 
that  both  m  and  n  go  to  infinity.  One  can,  of  course, 
collect  multiple  samples  in  each  short  run,  and  this  does 
ameliorate  the  problem  but  relies  on  the  “short”  runs 
actually  being  “long.”  The  closer  many  short  runs  is 
made  to  one  long  run,  the  better  it  is.  This  was  well  un¬ 
derstood  in  the  statistical  physics  literature  and  in  some 
of  the  early  statistics  literature,  but  needs  reiteration. 

This  is  not  a  purely  theoretical  point;  many  short  runs 
also  has  practical  drawbacks.  To  see  these  we  need  some 
discussion  of  the  practice  of  Markov  chain  Monte  Carlo. 
Typically  a  chain  is  run  for  a  while  to  “forget”  its  start¬ 
ing  point  before  samples  are  collected;  then  the  chain  is 
subsampled,  a  sample  being  taken  every  kth  step.  The 
number  of  samples  in  thrown  away  at  the  beginning  of 
the  chain  will  be  termed  the  “burn-in”  (there  is  no  stan¬ 
dard  terminology),  and  k  will  be  termed  the  “spacing.” 
The  empirical  estimate  for  such  a  subsample  is  defined 
by 

1  " 

Pn  3  =  -  Vit(A’„,+t  ,),  (6) 

n  ' 

1=1 

rather  than  (2).  Of  course  the  subsample  is  again  a 
Markov  chain  with  stationary  transition  probabilities, 
and  (3)  still  holds.  The  reasons  for  choosing  any  in 
other  than  zero  and  any  k  other  than  one  have  not  been 
made  clear.  The  spacing  k  is  often  chosen  to  be  large 
in  order  that  the  samples  X-m+k  i  be  “almost  indepen¬ 
dent”  as  if  reliance  were  being  placed  on  some  hypo¬ 
thetical  “almost”  law  of  large  numbers  rather  than  the 
ergodic  theorem.  Simple  variance  calculations,  which 
will  be  explained  below,  show  that  in  many  cases  k  =  1 
is  optimal  and  in  almost  all  cases  the  optimal  k  is  less 
than  five.  The  role  of  the  burn-in  m  is  also  not  well 
understood.  It  is  often  thought  that  in  must  be  chosen 
large  enough  so  that  A fTj  almost  has  marginal  distri¬ 
bution  P,  something  that  typically  cannot  be  checked. 
This  leads  to  using  very  large  m  for  “safety.”  If  t  he  one 
long  run  method  is  being  used,  a  fairly  large  burn-in, 
say  five  per  cent  of  the  total  run  length,  is  not  exces¬ 
sive  and  will  usually  be  more  than  adequate.  In  any 
case,  the  accuracy  of  the  method  is  ndatively  insensi- 
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live  to  the  burn-in.  Even  inadequate  burn-in  will  have 
only  a  small  effect  on  the  results.  The  many  short  runs 
method  perversely  arranges  the  calculation  so  that  not 
only  does  burn-in  dominate  the  cost  of  the  calculation 
(the  method  is  really  only  valid  as  the  burn-in  becomes 
infinite),  but  also  the  accuracy  critically  depends  on  the 
adequacy  of  burn-in,  which  is  uncheckable.  The  many 
short  runs  method  arranges  to  have  many  burn-ins  at 
much  cost  and  to  no  benefit. 

At  this  point  many  people  remark  that  even  if  one 
is  willing  to  concede  the  point  just  made,  multiple  runs 
have  some  diagnostic  value,  at  least.  This  is,  of  course, 
correct.  It  is  clear  that  if  two  runs  produce  completely 
different  answers,  the  runs  are  too  short.  But  this  di¬ 
agnostic  value  is  a  "one-edged”  sword.  It  is  not  valid 
to  draw  any  comfort  from  the  agreement  of  short  runs, 
even  many  short  runs.  Counterexamples  exist  that  prove 
such  hopes  illusory.  The  best  diagnostic  is  a  vi-ry  long 
run,  which  will  find  places  in  the  state  space  that  one 
never  thinks  to  start. 

With  these  general  comments  out  of  the  way,  we  now 
turn  to  specific  algorithms.  The  first  Markov  chain 
Monte  Carlo  method  was  given  by  Metropolis  et  al. 
(1953)  and  is  generally  known  as  the  "Metropolis  al¬ 
gorithm.”  This  algorithm  received  wide  use  in  the  sta¬ 
tistical  physics  community  from  the  beginning,  but  has, 
even  today,  had  little  use  in  the  statistics  community. 

Suppose  the  desired  stationary  distribution  has  a  den¬ 
sity  p  with  respect  to  some  measure  p.  The  algorithm 
employs  an  auxiliary  function  q{y,x)  such  that  q(  -  ,x) 
is  a  probability  density  with  respect  to  p  for  each  x  and 
q{x,y)  =  q{y,x)  for  all  x  and  y.  The  Markov  chain  is 
generated  by  repeatedly  applying  the  following  update 
step. 

1.  simulate  y  from  the  distribution  with  density 

q{  ■  ,x). 

2.  calculate  the  odds  ratio  r  =  p(y)/p(x) 

3.  if  r  >  1  go  to  y 

4.  if  r  <  1  go  to  2/  with  probability  r,  else  stay  at  x 

Simple  calculations  show  that  the  Metropolis  algorithm 
has  the  desired  distribution  with  density  p  as  one  sta¬ 
tionary  distribution  (see,  for  example,  Ripley,  1987).  If 
the  chain  can  be  shown  to  be  irreducible  (which  depends 
on  the  specific  structure  of  p  and  q).  it  is  ergodic  and 
can  be  used  for  Monte  Carlo. 

One  problem  with  the  Metropolis  algorithm  is  the  re¬ 
quirement  that  q  be  symmetric.  Hastings’  (1970)  al¬ 
gorithm  drops  this  requirement.  In  order  to  maintain 
the  correct  stationary  distribution,  this  requires  that  in 


step  2  of  the  Metropolis  update,  r  be  redefined  cis 

p(y)  q(x,y) 

V  — - 

p(x)  q(y.x) 

(so  it  can  no  longer  be  called  an  “odds  ratio." )  The 
algorithm  works  just  as  well  with  this  modification.  The 
Hastings  algorithm  allows  an  essentially  arbitrary  choice 
of  “candidate”  points. 

A  more  recent  algorithm  is  the  Cibbs  sampler  (Ce- 
man  and  Geman,  1984).  This  algorithm  is  applica¬ 
ble  only  when  the  state  variable  is  a  random  vector 

X  =  (xi . Xp);  il  does  not  ajiply  to  arbitrary  state 

spaces.  At  each  step  one  variable,  say  x,  .  is  changed  by 
giving  it  a  realization  from  the  cofiditional  distribution 
of  X,  given  the  rest  of  tht'  variables  under  the  stationary 
distribution. 

Though  this  looks  very  different  from  the  Metropolis 
and  Hastings,  it  is  almost  a  special  case  of  the  Hast¬ 
ings  algorithm  in  which  the  one-dimensional  conditional 
distributions  play  the  role  of  the  auxiliary  function  q. 
The  analogy  with  Hastings  does  suggest  that  when  one 
cannot  sample  exactly  from  the  one-dimensional  condi¬ 
tionals,  one  can  do  a  Hastings-like  rejection  to  correct 
inexact  sampling,  as  long  as  one  does  know  the  density 
one  is  sampling  from.  For  more  on  this  subject  see  Besag 
(this  volume). 

3  New  Methods 

All  of  the  literature  on  Markov  chain  Monte  Carlo  de¬ 
scribes  using  chains  with  ail  Metropolis  update  steps 
(a  Metropolis  algorithm)  or  pure  Gibbs  steps  (a  Gibbs 
sampler),  although  there  is  no  reason  for  this.  Any  steps 
that  preserve  the  stationary  distribution  can  be  mixed 
in  any  order.  To  make  a  chain  with  stationary  transi¬ 
tion  probabilities,  it  is  necessary  that  a  fixed  sequence 
of  steps  (called  a  “scan”)  be  repeated  over  and  over  and 
that  samples  be  collected  only  after  complete  scans  or 
multiplesof  cotnplete .scans.  This  is  typical  for  the  Gibbs 
sampler,  a  scan  consisting  of  updating  each  x;,  running 
through  the  variables  in  .some  fixed  order.  But  much 
more  general  scans  are  possible.  There  is  no  rea.son  not 
to  mix  Gibbs.  Metropolis,  and  Hastings  steps  in  a  single 
chain,  or  for  that  matter,  other  update  steps  yet  to  be 
invented.  Large  increases  in  speed  can  be  obtained  by 
clever  choices  of  update  steps. 

A  simple  example  is  to  attempt  to  make  a  variety  of 
steps  of  various  sizes.  When  the  distribution  of  inter¬ 
est  has  two  (or  more)  modes,  it  is  important  to  make 
attempts  to  jump  from  oiu-  mode  to  the  other,  if  at  all 
possible.  This  will  be  illustrated  lielow  in  the  discussio,] 
of  the  Ising  morlel.  whert'  the  modi's  are  roughly  sym¬ 
metrically  distributed  in  the  sample  space  and  henci' 
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easy  to  identify  and  one  can  jump  between  modes  via  a 
“symmetry  swap,”  changing  the  sign  of  all  variables  at 
once.  Metropolis  rejection  of  the  swaps  steps  preserves 
the  desired  stationary  distribution. 

It  is  not  always  possible  to  find  steps  that  jump  be¬ 
tween  modes,  or  even  to  find  out  (apart  from  Monte 
Carlo  experiments)  how  many  modes  there  are.  What 
is  needed  is  some  way  to  make  large  steps  without  ex¬ 
plicit  detailed  knowledge  about  the  distribution  of  inter¬ 
est.  A  device  which  we  are  calling  Metropolis-coupled 
Markov  chain  .Monte  (i'arlo,  (A/C)'*  for  short,  provides 
a  way  to  do  this  (Geyer,  1991b).  Suppose  we  run  r» 
Markov  chains  in  parallel,  having  different,  but  related, 
equilibrium  distributions,  P\,  ....  P,„.  For  example,  if 
the  distribution  of  interest  is  a  Gibbs  distribution  with 
density  proportional  to  I'ix)  being  the  poten¬ 

tial  function  and  t  the  temperature,  we  could  take  Pk  to 
have  density  proportional  to  ,  After  each  scan 

(in  which  all  of  the  chains  attempt  one  step  for  each 
variable)  we  attempt  to  swap  the  states  of  two  of  the 
chains.  This  is  a  Metropolis  update  since  swapping  is 
symmetric,  so  the  swap  of  chains  i  and  j  is  accepted  or 
rejected  according  to  the  odds  ratio 

^ 

P,(^.)Pj(^j)'  ^  ’ 

The  coupling  induces  dependence  among  the  chains,  and 
they  are  no  longer  (by  themselves)  Markov.  The  whole 
stochastic  process  (the  ?n  chains  together)  does  form 
a  Markov  chain  on  the  rri-fold  cartesian  product  of  the 
original  state  space.  Since  (7)  is  the  odds  ratio  assuming 
independence  of  the  distributions  for  the  chains,  the  sta¬ 
tionary  distribution  of  the  whole  process,  is  the  product 
of  the  Pi.  The  chains  arc  asymptotically  independent 
with  the  desired  stationary  distributions. 

If  the  coupling  does  not  change  the  stationary  distri¬ 
butions,  what  is  the  point?  It  may  make  all  of  the  chains 
mix  much  faster,  faster  than  any  one  of  them  uncoupled. 
This  effect  is  due  to  the  chains  having  different  distribu¬ 
tions.  It  is  clear  that  if  the  distributions  are  the  same, 
every  swap  is  accepted  and  the  chains  produce  the  same 
realizations  with  or  without  swapping.  If  one  untan¬ 
gles  the  swapped  chains  (following  one  state  as  it  jumps 
back  and  forth  among  the  distributions),  one  gets  a  dif¬ 
ferent  process.  Now,  by  symmetry,  all  of  the  untangled 
chains  have  the  same  marginal  distribution,  though  they 
are  no  longer  even  asymptotically  independent,  and  this 
marginal  distribution  must  be  the  equal  mixture  of  the 
distributions  Pi.  This  says  that  in  some  sense  the  speed 
of  the  chains  is  that  of  a  mixture  of  the  update  steps  for 
the  separate  chains.  'I'his  mixture  may  run  faster  than 
any  of  the  pure  chains. 

F.xarnples  of  t  hese  devices  will  b<'  giv('n  later  after  the 


Ising  model  is  described.  For  now.  let  us  close  this  sec¬ 
tion  with  the  point  that  if  one  is  worried  that  the  Gibbs 
sampler,  or  whatever  Markov  chain  scheme  one  is  using, 
mixes  too  slowly,  one  should  try  to  speed  it  up.  'Fhere 
are  many  possible  tricks  for  doing  so.  These  are  exam¬ 
ples  of  what  is  possible. 

4  Variance  Calculations 

Given  the  consistency  (3)  of  Markov  chain  .Monti'  Carlo, 
the  natural  next  question  is  to  examine  the  error 
\/n(Pn{l  —  Py)-  lypically  one  would  like  there  to  be 
a  central  limit  theorem 

\/n(Png  -  Pg) (8) 

(note  that  irj  depends  on  g).  When  the  state  space  of  the 
Markov  chain  Monte  Carlo  is  finite,  the  central  limit  the¬ 
orem  (8)  always  holds,  (see.  for  example,  Chung.  1967, 
p.  99  ff.  or  Ibragimov  and  Linnik,  1971.  pp.  365  369). 
There  are  Markov  chain  central  limit  theorems  for  non- 
finite  state  spaces,  but  the  regularity  conditions  seem 
difficult  to  apply  (this  is  a  subject  of  active  research  by 
a  number  of  investigators). 

Markov  chain  limit  theory  is  of  use  only  in  demon¬ 
strating  that  (8)  holds  with  ir^  finite;  it  does  not  yield 
the  value  of  irj;,  which  must  be  estimated  from  the 
Markov  chain.  Fhis  is  easily  done  using  standard  time- 
series  methods.  Hastings  (1970)  gave  references  to  meth¬ 
ods  then  current ;  only  slight  changes  are  needed  to  bring 
these  recommendations  up  to  date.  In  cases  of  practical 
interest  will  have  the  form 

rv. 

t  =  -rxj 

where 

7,  =  7_(  =  E(g(Xa).y(X,)) 

the  expectation  being  with  respect  to  the  stationary  dis¬ 
tribution.  The  7(  are  easily  estimated  by 

1 

h  =  l~i  =  -  '^g(Xi)g(Xi+t) 
n 

1  =  1 

For  why  we  divide  by  n  rather  than  n  —  i  see  Priestly 
(1981,  pp.  323-324).  One  might  think  that  the  sum  of 
the  7(  would  be  a  natural  estimator  of  but  this  is  a 
bad  idea  for  the  following  reason.  For  large  t  the  variance 
of  7i  is  approximately  constant 

Var(7,)«-  ^  1',  (10) 

Ji 

S  —  —  <. 

(Bartlett,  1946);  the  right  hand  in  (It))  does  not  depend 
on  t.  'I'his  assumes  that  i/(.\  )  has  a  fourth  moment  and 
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that  some  mixing  condition  holds  (p-mixing  suffices). 
Thus  the  sum  of  the  7,  differs  from  (9)  hy  n  terms  of 
size  l/n.  It  does  not  decrease  with  n;  the  estimate  is 
not  even  consistent.  In  order  to  get  a  good  estimate  it  is 
necessary  to  downweight  the  terms  for  large  lags,  which 
are  essentially  noise.  One  estimates  (T"  by 

where  w  is  some  weight  function  that  satisfies  u’(t)  —  I 
for  small  t,  w{t)  —  0  for  large  t,  and  makes  a  smooth 
monotone  transition  between  the.se  levels. 

The  right  hand  side  of  (10)  is  useful  in  choosing  w. 
One  can  take  u'(t)  =  I  for  I  such  that  7;  exceeds  two 
‘darge  t"  standard  deviations.  Since  it  is  usually  impos¬ 
sible  to  arrange  a  chain  with  significant  negative  auto¬ 
correlations,  one  can  take  w(t)  —  0  when  7(  <  0  and 
for  all  larger  t.  Any  smooth  curve  connecting  the.sc  two 
points  is  satisfactory.  We  use  a  scaled  cosine. 

Kefore  leaving  this  subject,  the  frequency  domain  ver¬ 
sion  of  the  same  procedure  should  perhaps  be  explained, 
since  one  may  see  this  described  instead  and  the  equiva¬ 
lence  of  the  two  incthod.s  is  not  obvious.  (9)  is  ‘2tt  times 
the  value  of  spectral  density  at  the  origin  (of  the  time 
series  (/(A;)).  To  estimate  the  s[)octral  density  one  may 
usf'  a  kernel  smoother  with  kernel  u;  on  the  empirical 
spectral  estimate,  which  is  the  Courier  transform  of  tin' 
'It  ■  If  one  uses  t  he  l'ouri<’r  t  ransform  of  u  for  the  smooth¬ 
ing  kernel  tr.  one  obtains  exactly  the  same  estimate  as 
(11).  In  the  usual  t inu'-series  [)arlance  ir  is  called  a  lag 
window  atid  u  a  spectral  window. 

5  Choosing  the  Spacing 

Having  a  method  of  estimating  variatices  givt's  us  a 
itK'thod  of  iiK’asiiring  the  •■s()eed"  of  a  .Markov  chain 
scheme  .\  chain  is  rapidly  mixing  if  the  autocorrela¬ 
tions  decrease  rai^idly  enough  so  that  tin'  variance  of  our 
estimate(s)  of  interest  is  small.  This  is  a  relative  term, 
we  ran  only  say  that  one  chain  mixes  more  rapidly  than 
another,  there  is  i\o  ai>snl\ite  standard 

One  obvious  cfjmparison  is  between  chains  that  are 
alike  except  for  <lifrerent  siiaring.  Sujipose  t  hat  the  chain 
is  p-mixiiig  (always  true  if  the  state  sjiare  is  finite)  so 
the  7,  decrease  exponentially  bust,  then  the  ;usy'mptotic 
variance  for  a  chain  with  spai  ing  h  wiM  be 


for  some  const, lilts  .1  .•  0  and  It  <  />  ^  I  ( 'l<  arly 

as  t  —  -e  the  variance  s*.  converges  to  the  marginal 


variance  70  that  would  be  obtained  if  one  could  do  in¬ 
dependent  sample  .Monte  Carlo.  .Since  the  convergence 
is  exponentially  fast,  there  is  little  biuiefit  to  large  spac- 
ings.  do  see  this  more  clearly,  let  /i  be  the  cost  of  sam¬ 
pling  (typically  computer  time),  and  let  C  be  the  cost 
of  “using”  a  sample.  If  the  samples  cost  almost  nothing 
to  use,  one  may  take  C  =  0.  If  one  uses  n  samples  with 
spacing  k,  the  cost  is  /hik  +Cii,  because  the  chain  runs 
for  nk  steps  and  n  samples  are  used.  The  variance  of 
the  estimate  is  approximately  Sk/n.  Hence  to  get  a  fixed 
accuracy  oiu'  must  have  n  jirojiortional  to  .s*..  llius  the 
cost  for  spacing  k  is  proportional  to  (Hk  -I-  C)si.  For 
large  k  this  inrrea.ses  linearly  in  k.  I'he  minimum  cost 
will  be  attained  for  some  small  value  of  k.  the  optimal 
spacing.  Note  that  if  C  =  t)  the  optimal  spacing  is 
greater  than  on<'  only  if  .S|  >  2.s'>.  which  is  tyiiically  not 
the  case.  One  needs  some  cost  of  using  samples  (cost  of 
calculating  estimates,  cost  of  storing  samples,  plotting 
samples,  or  whatever)  to  make  subsampling  a  good  idea. 

If  one  is  interested  in  calculating  integrals  of  many 
functions  ij.  then'  is  no  one  s(>acing  that  is  optimal  for 
all,  nor  would  oiu'  want  to  do  varianci'  calculations  for 
all.  Fortunately,  this  is  not  necessary,  lypically  the 
cost  curves  will  be  F-shai>ed  with  a  broad  bolt  m  and 
the  curves  for  a  representative  sample  of  functions  will 
have  minima  in  roughly  the  sam(>  place.  We  do  not  rec- 
omiiK'nd  ( laborate  varianci'  calculations  accompanying 
('Very  .Markov  chain  .Monte  ( 'arlo  c'sl  imati',  but  tlu're  is 
no  substit ute  for  soiiii  varianci'  calculations  for  compar¬ 
ing  methods,  for  si'lectiug  spacings.  and  just  gi'lii'rally 
getting  a  feel  for  how  well  a  scln'ine  works. 

6  The  Ising  Model 

The  modi'l  ('inployi'd  for  our  I'xample  is  a  standard  two- 
parameter  Ising  modi'l  on  a  .'fd  x  .'52  sipiari'  lattice  with 
periodic  boundary  conditions,  bet  x,  denoti'  the  ratidom 
variabh'  at  lattice  siti'  1  which  takes  values  in  1.  1). 
and  X  =  (j’,  I  di'uote  tin'  wlioh-  random  fii'ld.  b'  t  1  j 
denote  that  siti's  i  and  j  are  neari'st  neighbors.  Kvery 
site  has  four  in'ighbors.  sinci'  the  lattice  is  considi'n'd 
a  torus.  I  In'  statistical  model  is  ;;  t wo-parann'ter  ex¬ 
ponential  family  with  natiir.il  statistics  l](x)  —  •''1 

and  t-j{x)  =  I  '”'  '  oin  retein'ss  we  will 

call  the  lattice  sites  with  x,  1  "white  pixi'Is"  and  tin- 
rest  "black  pixi'ls  following  th."  languagi' of  imagi' pro¬ 
cessing.  bin  n  1 1  is  the  exci  ss  of  whit e  ovi  r  black  pixels, 
and  l-j  is  the  excess  of  concordant  nearest  in'ighbor  pairs 
ovi'r  discordant  pairs 

rile  probability  of  a  point  x  in  the  sample  space  is 
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where  {t.O)  =  tiO\  +t20->  and 

r((?)  =  (12) 

j-es 

The  parameters  0\  and  On  are  referred  to  here  as  the 
“level”  parameter  and  “dependence”  parameter  respec¬ 
tively.  VVe  also  use  the  notation  o  =  6\  and  J  =  On. 

At  /^  =  0,  the  pixels  are  independent;  for  large  /d  the 
distribution  has  two  modes,  almost  all  of  the  pixels  are 
the  same  color  with  just  a  speckle  of  the  other.  The 
proportion  of  realizations  that  are  predominantly  white 
or  black  depends  on  a:  when  a  =  0,  the  modes  are 
equally  probable.  This  behavior  occurs  for  all  lattice 
sizes,  even  for  an  infinite  lattice,  where  the  transition 
from  patches  of  both  colors  to  (almost)  all  one  color 
occurs  sharply  at  the  critical  value  sinh“'(  1 )  =  0.4407. 
The  transition  is  not  sharp  for  finite  lattice  sizes,  but 
occurs  in  roughly  the  same  place. 

For  any  lattice  site  i,  let  x-i  denote  the  rest  of  the 
variables  besides  x,.  The  conditional  distribution  of  x, 
given  x_j  plays  an  important  role  in  both  likelihood  and 
pseudolikelihood  methods.  This  conditional  distribution 
is  denoted  ps(xi|x_j).  Let  rq  =  denote  the  sum 

of  the  nearest  neighbors  of  lattice  site  i.  Then 

logit  p«(xi  =  l|x_,)  =  logitp9(x,  =  l|nt) 

=  ‘2{0x  +  OnTl,).  (13) 

The  first  equality,  that  the  distribution  of  x,  given  the 
rest  depends  only  on  its  neighbors,  is  called  the  spatial 
Markov  property.  It  simplifies  calculations,  but  other¬ 
wise  plays  no  role  in  the  analysis. 

A  Metropolis  algorithm  for  the  Ising  model  runs  over 
the  variables  in  either  fixed  or  random  order  attempting 
to  swap  the  state  of  the  variable  at  each  step  (from  1  to 
—  1  or  vice  versa)  according  to  the  odds  ratio  of  these  two 
states.  A  Gibbs  sampler  does  the  same  thing  but  instead 
samples  from  the  conditionals.  Mc'tropolis  makes  mon' 
transitions  and  hence  is  a  bit  better,  but  there  is  not 
much  difference. 

Whichever  is  used,  it  is  wise  to  follow  each  scan  of 
all  the  variables  with  a  symmetry  swap,  attempting  to 
chang('  X  for  — x,  where  — x  denotes  the  stale  derivi’d 
from  X  by  changing  the  sign  of  all  tin'  variables.  Tin* 
odds  ratio  for  this  swap  is  r  =  exp(/i(— x)a  — /|(x)o) 
since  /2(x)  =  ln(—x).  When  o  is  small  and  .f  is  larg<>  so 
the  model  has  a  bimodal  distribution,  these  swaps  jump 
between  modes.  For  other  parameter  values,  the  swaps 
are  not  useful,  but  they  are  also  not  needed  since  lh« 
<lisl ribnt  ion  is  iimmodal  and  the  Markov  chain  mixes 
rapidly  in  any  ca.se  4'he  swajis  do  no  harm,  though, 
since  they  consume  a  small  frariioti  of  the  running  lime. 

With  symmei  ry  swaps  the  .Markov  chain  for  the  Ising 
model  runs  fast  no  matter  wlial  tin-  |)arameter  values. 


provided  it  is  started  in  the  right  place:  all  pixels  the 
same  color.  If  one  chooses  a  random  starting  point,  and 
/?  is  well  above  the  critical  j.oiut,  it  lakes  a  very  lung 
time  to  get  to  any  likely  configuration. 

Symmetry  swaps  solve  all  difficulties  of  simulating 
Ising  models  (and  other  lattice  processes  with  only  a 
few  colors).  Henct*  Metropolis-coupling  is  not  netxled. 
To  avoid  introducing  another  model,  however,  let  us  also 
solve  the  Ising  model  difficulties  using  Metropolis  cou¬ 
pling.  At  values  of  ii  well  below  the  critical  value,  a  sin¬ 
gle  chain  runs  fast,  the  distribution  is  unimodal,  and  the 
region  of  high  probability  is  ra|)idly  explored.  For  very 
high  id  the  chain  runs  arbitrarily  slowly;  the  waiting  time 
for  a  transition  between  moth's  can  be  arbitrarily  long 
If  low  and  high  id  chains  art'  coupletl  with  a  sequence 
of  intermediate  id  chains,  swaps  will  occur  frequently  if 
adjacent  id's  are  close  enough,  and  all  of  the  chains  will 
mix  rapidly.  Thus  Metropolis  coupling  can  produce  an 
arbitrarily  large  speed  up  in  some  situations.  This  so¬ 
lution  to  problems  of  slow  mixing  is  completely  general, 
it  does  not  even  require  knowledge  of  a  good  starting 
point  (as  did  symmetry  swapping).  .Ml  that  is  required 
is  that  some  of  the  coupled  chains  mix  rapidly. 

It  is  possible  to  get  an  infinite  speed  up  from  coupling 
chains.  If  one  couples  a  chain  that  is  not  ergodic  (so 
that  it  would  never  get  the  right  answer)  with  one  that 
is,  this  can  make  both  chains  ergodic.  Thus  coupling  can 
be  used  to  solve  difficult  problems  of  finding  a  Markov 
chain  that  is  ergodic  as  well  as  problems  of  slow  mixing. 

7  Monte  Carlo  Maximum  Likelihood 

Consider  a  family  of  probability  densities  {/|j}  with  re¬ 
spect  to  some  measure  //.  where  the  densities  are  known 
only  up  to  a  normalizing  constant 

where  hg  is  a  known  function  for  each  0  but  nothing  is 
known  about  c  except  that 

z{0)  =  I  hg[x)dii(x). 

the  integral  being  analytically  intractable.  Ihe  Ising 
ni'idel  serves  as  an  exam[)le  with  //#(x)  =  Other 

■  .\amples  include  spatial  lattice  and  point  processes. 
.Markov  grajihs,  higislic  regression  with  (fi'pendent  re¬ 
sponses  (s('e  Gi-yi'r  and  riiomi'son.  lfi!)2). 

Ihe  unknown  iiormali/mg  constant  ;  is  no  bar  to 
.Markov  chtiin  Monti'  t'arlo  which  can  provide  a  sam¬ 
ple  .\  I ,  .Vj,  from  anv  o  in  the  parameter  space.  This 
can  be  used  to  estimate  the  log  likelihood  ratio  for  an 
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of  the  last  term  in  (14).  Let  /n(0)  denote  (14)  with  the 
Icist  term  replaced  by  (15).  By  the  ergodic  theorem  we 
have  that  l„(0)  — >  1(0)  simultaneously  for  all  0  in  any 
countable  set,  which  if  the  parameter  takes  values  in 
may  be  chosen  to  be  dense.  This  along  with  the  “usual” 
regularity  conditions  may  be  enough  to  ensure  that  if 
is  any  maximizer  of  I„  and  0  the  maximizer  of  /,  then 
On  0,  i.  e.,  the  Monte  Carlo  MLE  converges  to  the 
true  MLE  as  the  size  of  the  Monte  Carlo  sample  goes  to 
infinity.  For  the  Ising  model  no  regularity  conditions  are 
needed  because  both  I  and  /«  are  concave  functions.  Sec¬ 
ond  order  theory,  \/n(0„  —0)  converging  to  some  normal 
distribution  is  also  available,  again  under  the  “usual” 
regularity  conditions,  when  the  asymptotic  variance  of 
\/nVln(0)  can  be  shown  to  be  finite,  since  this  can  then 
be  estimated  empirically  using  the  methods  of  Section  4. 
Details  will  appear  elsewhere. 

This  method  can  be  generalized  to  use  Monte  Carlo 
samples  from  distributions  other  than  those  in  the  para¬ 
metric  family,  in  particular  to  mixtures  of  distributions 
in  the  family.  This  improves  performance  when  0  is  far 
from  <^,  and  is  the  method  used  for  the  example  in  Fig¬ 
ure  1.  Details  of  the  theory  and  the  calculation  of  this 
example  are  given  in  Geyer  (1991a). 

Given  that  maximum  likelihood  can  be  done,  how  well 
does  it  compare  with  other  methods?  Is  it  worth  the  ef¬ 
fort  of  the  elaborate  Monte  Carlo  calculations?  What  is 
analytically  tractable  about  the  Ising  model  (and  other 
Markov  spatial  processes)  is  the  conditional  distribu¬ 
tions  pg(xi  =  l|x_,)  defined  by  (13).  The  pseudolike¬ 
lihood  is  the  product  of  these  conditionals.  This  is  not, 
of  course,  a  likelihood,  since  these  conditionals  do  not 
combine  in  the  right  way  to  make  a  probability.  The 
MPLE  is  found  by  maximizing  the  log  pseudolikelihood 


</’((?)  =  ^logp#(li  l|f_i) 
i 


(Besag,  1975).  For  the  Ising  model  this  is  computation¬ 
ally  equivalent  to  doing  a  logistic  regression  of  each  pixel 
on  its  neighbors.  The  estimate  takes  negligible  time  to 
compute  compared  to  Monte  Carlo  MLE. 


-0  1  00  0  1 


Figure  1;  Comparison  of  MLE  and  MPLE.  Top  MLEs, 
bottom  MPLE  for  sample  of  500  points  from  Ising  model 
with  Q  =  0  and  /?  =  0.425. 

Furthermore,  it  is  a  good  estimate  for  small  depen¬ 
dence,  when  pgixi  —  l|zi-i)  Pg{xi)  when  it  well  ap¬ 
proximates  maximum  likelihood.  For  high  dependence, 
MPLE  can  do  much  worse  than  MLE,  as  shown  in  Fig¬ 
ure  1.  The  true  parameter  value  is  where  the  solid  lines 
cross.  Both  estimators  cluster  around  the  truth,  but 
MPLE  has  much  wider  scatter.  Moreover,  maximum 
likelihood  “senses”  the  critical  point,  shown  by  the  dot¬ 
ted  line,  in  a  way  that  MPLE  docs  not.  Of  the  500  points 
in  the  sample,  only  six  are  above  the  critical  point,  only 
two  appreciably  so.  The  dotted  line  in  the  figure  is  like 
a  cliff  of  the  likelihood  surface.  'I'hese  samples  from  a 
process  below  the  critical  point  do  not  look  at  all  like 
they  came  from  a  process  above  the  critical  point. 

Pseudolikelihood  is  oblivious  to  the  critical  point, 
which  is  not  surprising,  since  it  only  looks  at  local  de¬ 
pendence  and  the  critical  point  phenomenon  is  a  global 
property.  I'here  are  134  of  the  MPLE  lying  above 
the  critical  point.  .Some  .so  high  that  true  realizations 
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from  such  parameter  values  would  he  hard  frozen,  not 
remotely  resembling  the  observation  from  which  the 
MPLE  was  calculated. 

8  Discussion 

Though  consistency  and  asymptotic  normality  of  MPLE 
has  been  proved  in  a  variety  of  situations,  these  results 
do  not  guarantee  good  behavior  at  finite  sample  sizes.  It 
has  never  been  claimed  that  MPLE  would  provide  good 
estimates  for  parameters  of  a  frozen  (or  nearly  frozen) 
Markov  random  field,  so  the  message  that  in  some  cases 
MLE]  behaves  well  when  MPLE  does  poorly  is  no  sur- 
pri.se.  That  ME’LE  can  be  inefficient  had  been  noted 
for  (.Jaussian  random  fields  on  lattices  (Besag,  1977), 
where  the  efficiency  goes  to  zero  at  the  boundary  of  the 
parameter  space  where  stationarity  is  lost.  Moderately 
large  efficiency  is  maintained,  however,  for  fairly  largo 
dependence,  which  giv<'s  the  impression  that  MPLE  is  a 
reasonable  method  of  estimation  for  Gaussian  fields  so 
long  as  the  1ru<'  parameter  value  is  not  near  the  bound¬ 
ary  of  the  parameter  space. 

Ising  models  and  other  non-Gaussian  random  fields 
can  have  critical  parameter  values  not  on  the  boundary 
of  the  parameter  space  at  which  the  qualitative  behav¬ 
ior  of  the  field  changes.  Near  such  values,  and  for  high 
dependence  in  general,  MPLE  can  give  bad  results.  One 
Ising  model  example  is  given  here;  a  more  complex  ex¬ 
ample  is  given  in  Geyer  and  Thompson  (1992).  'I'his 
does  not  say  MPLE  is  bad  in  all  problems;  it  seems  that 
comparisons  must  be  made  problem  by  problem. 
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^  Abstract 

This  paper  reports  our  experiments  about  parallel  and 
sequential  implementations  for  combining  belief  func¬ 
tions  with  an  application  to  a  medical  diagnostic  system. 
We  use  as  a  basis  existing  methods  for  combining  two 
belief  functions:  a  direct  combination  based  on  Demp¬ 
ster’s  rule  and  an  indirect  combination  through  Mobius 
transforms.  We  further  explore  various  parallel  algo¬ 
rithms  for  combining  more  than  two  belief  functions, 
as  different  belief  functions  can  be  combined  in  any  or¬ 
der  2is  long  zis  they  ate  independent  of  each  other.  Our 
results  indicate  that  for  the  general  case,  the  parallel 
implementation  based  on  fast  Mobius  transforms  proves 
to  be  the  most  efficient.  However,  for  practical  applica¬ 
tions  where  most  subsets  of  a  frame  of  hypotheses  have 
zero  probabilities,  the  parallel  implementation  based  on 
an  improved  direct  combination  rule  remains  the  most 
efficient.  ^ . 

1  Introduction 

This  paper  presents  parallel  and  sequential  algorithms 
for  combining  belief  functions.  The  Belief  Function 
approach  for  approximate  reasoning,  also  called  the 
Dempster-Shafer  theory  [Shafer,  197(u,  can  be  seen  as  a 
generalization  of  the  Probabiliiy  approach  [Pearl,  1988], 
since  probabilities  are  assigned  directly  to  subsets  of  a  set 
of  mutually  exclusive  and  exhaustive  hypotheses  rather 
than  each  of  the  hypotheses. 

One  important  problem  for  the  application  of  the  DS- 
theory  is  the  efficiency  for  combining  the  belief  functions 
from  different  evidences.  Barnett  [l98l]  proposed  a  poly¬ 
nomial  algorithm  which  only  applies  to  sets  of  single  hy¬ 
potheses  or  singletons.  Work  by  ([Shafer  and  Logan, 
1987]  and  [Shafer  et  al.,  1987])  deals  with  extended  sub¬ 
sets  that  form  a  hierarchical  structure.  More  recently, 
Kennes  and  Smets  [l990]  apply  fast  Mobius  transforms 
to  reduce  redundant  computations  and  thus  improve  the 
efficiency  even  for  the  general  case. 

In  this  paper,  we  are  concerned  with  the  efficient  com¬ 
bination  for  more  than  two  belief  functions.  We  use  as 
a  basis  existing  methods  for  combining  two  belief  func- 
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tions;  a  direct  combination  beised  on  Dempster’s  rule  and 
an  indirect  combination  through  Mobius  transforms.  We 
further  explore  parallel  algorithms  for  combining  more 
than  two  belief  functions  in  order  to  improve  the  effi¬ 
ciency,  cis  different  pieces  of  evidence  can  be  combined 
in  any  order  2is  long  as  they  are  independent  of  each 
other. 

To  further  test  our  algorithms,  we  consider  a  medi¬ 
cal  domain  that  involves  the  diagnosis  of  different  types 
of  canine  liver  diseases  (McLeish  et  al.  [l989],  [l990], 
[l99l]).  This  is  a  domain  on  which  doctors  have  diffi¬ 
culty  predicting  precise  or  single  outcomes,  as  both  the 
numbers  of  possible  outcomes  (14)  and  available  tests 
(40)  are  quite  large.  In  terms  of  the  DS-theory,  this 
would  require  a  combination  of  40  belief  functions  over  a 
frame  of  14  different  hypotheses^ .  Although  our  parallel 
algorithms  can  largely  speed  up  the  implementation,  the 
amount  of  time  used  is  still  quite  long.  Fortunately,  for 
practical  applications,  especially  our  domain,  we  found 
that  most  of  the  subsets  have  zero  probabilities;  the  num¬ 
ber  of  subsets  that  have  non-zero  probabilities,  called 
the  focal  elements,  are  just  about  10  on  average.  Thus, 
special  versions  of  our  algorithms  can  be  designed  to  fa¬ 
cilitate  the  practical  application.  Our  algorithms  are  all 
implemented  on  a  Sequent  machine  using  the  parallel  C 
language  and  the  experimental  results  are  reported  later 
in  detail. 

2  Review  of  the  DS-theory 

In  DS-theory,  probabilities  are  assigned  directly  to  sub¬ 
sets  of  a  frame  of  hypotheses,  called  a  mass  function 
(m).  Two  pieces  of  evidences  can  be  combined  using  the 
Dempster's  rule,  where  m\  and  m-j  are  the  mass  func¬ 
tions  for  the  given  evidences; 

_  ^{mi(Bi)Tn2(B2)  \  Bi  D  =  B} 

^{m,(Bi)m2(B2)|Binfl2^0} 

The  rule,  as  stated  in  [Buchanan  and  Shortliffe.  1984], 
provides  a  way  of  narrowing  the  hypothesis  set  with  the 

'.See  [McLei.sh  and  Song,  199l]  for  the  general  framework 
of  our  expert  system  for  diagnosing  ranine  liver  diseases 
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accumulation  of  evidence  and  naturally  captures  the  pro¬ 
cess  of  diagnostic  reasoning  in  medicine  and  expert  rea¬ 
soning  in  general. 

There  are  two  ways  for  combining  mass  functions 
proposed  in  the  current  literature  ([Shafer,  1976]  and 
[Kennes  and  Smets,  1990]).  One  is  the  direct  combina¬ 
tion  based  on  the  Demspter’s  rule,  for  which  it  can  be 
shown  that  the  following  theorem  holds: 

Theorem  2.1  The  direct  implemeniaiion  of  the  Demp¬ 
ster’s  rule  needs  (2"  —  1)^  additions  and  2"(2”  —  1)  mul¬ 
tiplications. 

The  other  way  for  combining  mass  functions  is  the  in¬ 
direct  combination  through  Mobius  transforms.  Based 
on  a  mass  function,  a  commonality  function  (Q)  is  fur¬ 
ther  defined  in  [Shafer,  1976]: 

Q{A)  =  \BDA} 

With  commonality  functions,  the  combination  of  differ¬ 
ent  evidences  is  reduced  to  the  multiplication  of  the  com¬ 
monality  functions, 

Q(A) KQriA) . .  .QniA) 

where  /\'“*  is  a  constant  that  does  not  depend  on  A. 

A  Mobius  transform  is  a  function  defined  over  a 
partially  ordered  set.  For  example,  the  computations 
from  m  to  <3  and  vice  versa  are  all  Mobius  transforms. 
The  idea  of  a  fast  Mobius  transform  is  to  decompose 
the  whole  transform  into  a  series  of  simple  transforms 
[Kennes  and  Smets,  1990].  In  each  step,  as  illustrated 
in  figure  1,  we  only  consider  one  hypothesis  and  its  re¬ 
lated  transform.  For  example,  the  first  step  will  achieve 
the  transform:  {(A,  T)  [  A  0  and  (F  =  X  or  T  = 
X  U  {c})},  where  X  and  Y  are  two  subsets  of  0.  Then, 
by  recursively  doing  this  for  all  the  hypotheses,  we  will 
be  able  to  transform  from  one  function  to  another  func¬ 
tion. 


(mi,  m2) 


TTli  (2)  7Tl2 


m 


mtoq 

(QuQi) 


Qi  *  Q2 


qtom 


Q 


Figure  2:  Combination  through  Mobius  Transform 


Lemma  2.1  Suppose  m  and  Q  are  two  functions  defined 
over  a  frame  0,  then  we  have: 

QiA)  = 

BOA  BOA 

Based  on  the  above  lemma,  we  can  now  construct  a  fast 
Mobius  transform  from  Q  to  m.  It  is  the  same  as  the 
transform  from  m  to  Q  except  that  all  the  links  have 
weighting  factor  (—1)  (see  [Kennes  and  Smets,  1990]  for 
detailed  discussions). 

Theorem  2.2  The  indirect  implementation  of  Demp¬ 
ster’s  rule  through  Mobius  transforms  needs  3n2”“'  ad¬ 
ditions  and  2"'^^  multiplications. 

3  Algorithms  for  Combining  Belief 
Functions 

In  this  section,  we  consider  how  to  combine  r  pieces  of 
evidence  efficiently,  with  r  >  2.  In  particular,  we  present 
three  pairs  of  algorithms  for  combining  r  mass  functions: 
sequential,  parallel,  and  practical  methods. 


{  }  {a}  {b}  {a,b}  {c}  {a,c}  {b,c}  {a,b,c} 


Figure  1:  Diagram  for  the  Transform:  m  —>  Q 

To  combine  mass  functions,  we  follow  the  path  from 
{mi}  to  {Qi}  to  Q  to  m,  as  shown  in  figure  2.  However, 
although  the  transform  from  Q  to  m  is  not  provided 
in  [Shafer,  1976],  it  can  be  proved,  following  a  similar 
approach,  that  the  following  lemma  holds. 


3.1  Sequential  Combination  Methods 

Based  on  the  two  methods  introduced  earlier,  we  can 
provide  two  sequential  algorithms  for  combining  more 
than  two  belief  functions.  A  sequential  algorithm  based 
on  Dempster’s  rule  can  be  given  as  follows: 

algorithm  3.1  sequential  fc  direct  implementation 
input  m[l  :  r][0  :  2"  —  1],  r  bodies  of  mass  functions, 
and  n,  the  cardinality  of  the  frame 
output  m[l][0  :  2"  —  1],  the  combined  mass  function 
begin 

for  i  =  2  step  1  until  r  do 
comb-two(m[l],  m[t]) 
endfor 

end 

Here,  we  use  a  n-digit  binary  number  to  represent  a 
frame  of  size  n,  and  for  each  subset,  the  ith  element 
is  1  if  the  corresponding  element  is  in  the  subset.  Also, 
“comb-two”  is  a  procedure  for  combining  two  ma.ss  func¬ 
tions. 
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Corolary  3.1  Algorithm  S.l  needs  (r  —  1)(2"  —  1)^  ad¬ 
ditions  and  (r  —  1)2"(2”  —  1)  multiplications. 

Another  way  of  implementing  the  Dempster’s  rule 
is  to  compute  the  combined  mass  function  indirectly 
through  Mobius  transforms.  A  sequential  algorithm  for 
this  method  can  be  given  as  follows: 

algorithm  3.2  sequential  &  indirect  implementation 
begin 

for  i  =  1  step  1  until  r  do 
mtoq(  m[i]) 
endfor 

for  i  =  0  step  1  until  2”  —  1  do 
for  j  =  2  step  1  until  r  do 
m[l][?:]  ^  m[l][i]  ♦  m[j][i] 
endfor 
endfor 

qtom(m[l]) 

end 

Corolary  3.2  Algorithm  3.2  needs  n{r  +  1)2"“'  addi¬ 
tions  and  r2"  multiplications. 

3.2  Parallel  Combination  Methods 

Since  in  DS-theory,  different  pieces  of  evidence  can  be 
combined  in  any  order  as  long  as  they  are  independent 
of  each  other,  we  can  further  explore  parallel  algorithms 
for  the  combination  of  more  than  two  belief  functions. 

algorithm  3.3  parallel  &  direct  implementation 
begin 

while  r  >  1  do 

r'  =  r/2 

for  i  =  1  step  1  until  r'  do  in  parallel 
comb-two(m[?'],  7n[r'  +  i]) 
endfor 

if  odd(r)  then 

m[r'  +  1]  =  r  =  r'  +  1 
else  r  —  r' 
endwhile 

end 

Corolary  3.3  Algorithm  3.3  needs  [logr](2"  —  1)^  addi¬ 
tions  and  [logr]2"(2"  —  1)  multiplications,  where  [logr] 
stands  for  the  smallest  integer  that  is  greater  or  equal  to 
logr. 

algorithm  3.4  parallel  fc  indirect  implementation 
begin 

for  i  =  1  step  1  until  r  do  in  parallel 
mtoq(m[j]) 
endfor 

for  1  =  0  step  1  until  2"  —  1  do  in  i>arallel 
for  j  —  2  step  1  until  r  do 
m[l][;]  —  m[l][):]  *  m[j][i] 
endfor 
endfor 


qtom(m[l]) 

end 

Corolary  3.4  Algorithm  3-4  needs  n2"  additions  and 
2"  +  r  multiplications. 

3-3  Practical  Combination  Methods 

To  further  test  our  algorithms,  we  choose  a  medical  do¬ 
main  that  involves  the  diagnosis  of  canine  liver  diseases. 
We  found  that  for  such  a  domain,  most  of  the  mass  func¬ 
tions  only  have  a  small  number  of  non-zero  subsets,  or 
focal  elements.  Although  the  above  algorithms  work  for 
general  cases,  for  practical  reasons,  we  must  revise  them 
to  facilitate  the  almost  null  distribution  of  mass  func¬ 
tions. 

In  the  following  we  first  provide  a  revised  procedure 
for  direct  combination  based  on  Dempster’s  rule. 

function  comb-two'(mi ,  mo,  Li ,  Lj) 

begin 

for  1  =  1  step  1  until  L\  do 
for  i  =  1  step  1  until  L-i  do 
s  —  Si[?:]  k  S2L/] 
m[s]  <—  m[s]  +  mi[i]  *  m2[j] 
endfor 
endfor 
A'  <—  1  —  m  [0] 

for  1  =  1  step  1  until  2”  —  1  do 
if  m[i]  >  0  then 
L  ^  L+  1 

si[i]  i;  mi[L]  *-  m[i]/A' 
endif 
endfor 
return  L 

end 

Here,  is  the  bitwise  operator  for  the  logical  opera¬ 
tion  “AND”,  corresponding  to  the  intersection  operation 
between  two  subsets. 

Then  a  parallel  algorithm  for  combining  more  than 
two  mass  functions  can  be  designed  as  follows: 

algorithm  3.5  practical  par.  k  dir.  implementation 
begin 

while  r  >  I  do 
r'  <—  r/2 

for  i  =  1  step  1  until  r'  do  in  parallel 

L[i]  ♦—  comb-two'(m[i],  m[r'  -|-  i],  L[i],  L[r'  +  i]) 
endfor 
if  odd(r)  then 

m[r'  1]  <—  m[r];  r  <—  r'  -|-  1 
else  r  *—  r' 
endwhile 
end 

To  see  how  speed  can  be  gained  for  the  above  algo¬ 
rithm,  let  us  consider  our  domain  of  canine  liver  dis¬ 
eases.  For  a  frame  of  size  14,  2*''  gives  us  16,384.  Thus, 
the  direct  combination  of  two  mass  functions  would  re¬ 
quire  (2’'*  —  1  add  it  ions  and  2 ’■'(2*'’  —  1)  multi|)licat  ions. 
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However,  the  above  improved  direct  combination  would 
only  need  about  100  additions  and  110  multiplications, 
as  the  average  number  of  focal  elements  is  10  for  any 
mass  functions  in  our  domain  (see  [McLeish  and  Song, 
1991]  for  different  methods  of  extracting  mass  functions 
from  medical  data  collected  over  time). 

Similarly,  we  can  add  a  testing  statement  in  a  Mdbius 
transform  and  only  perform  an  addition  when  the  new 
element  is  non-zero.  Since  the  cost  of  a  testing  statement 
is  usually  less  than  an  arithmetic  operation,  we  would 
expect  some  saving  of  time  when  most  of  the  subsets 
have  zero  probabilities.  The  modified  algorithm  based 
on  the  Mobius  transforms  will  be  called  algorithm  3.6  in 
our  experiments. 

4  Experimental  Results 

Our  algorithms  are  all  implemented  on  a  Sequent  Sym¬ 
metry  machine  using  the  Parallel  C  language  [Osterhaug, 
1989].  A  Sequent  machine  has  an  architecture  of  truly 
multiple  processors  and  a  shared  memory,  all  connected 
through  a  system  bus.  This  provides  a  way  for  increasing 
the  accessibility  of  data  and  minimizing  the  communica¬ 
tion  cost.  As  a  result,  we  can  actually  run  our  algorithms 
on  this  machine  and  observe  the  improvement  of  speed 
for  a  problem  of  reasonable  size. 

In  our  experiments,  we  run  our  algorithms  on  a  ma¬ 
chine  of  ten  processors.  Our  results  can  further  be  im¬ 
proved  when  more  processors  are  available,  say  16  or 
32,  which  become  more  and  more  common  for  Sequent 
machines.  Although  our  system  is  not  large,  it  already 
shows  the  potential  of  using  parallel  algorithms  for  effi¬ 
ciently  combining  belief  functions. 


#  Mass 

Alg3.2 

Alg3.4 

Alg3.5 

Alg3.6 

02 

13.26 

9.18 

0.43 

7.56 

03 

17.71 

9.25 

0.95 

7.59 

04 

22.19 

9.31 

0.95 

7.72 

05 

26.74 

9.37 

1.51 

7.76 

08 

40.22 

13.88 

1.49 

11.30 

10 

49.19 

14.00 

2.04 

11.46 

15 

71.68 

18,65 

2.35 

15.15 

16 

76.21 

23.01 

2.34 

18.65 

20 

94.21 

23.30 

2.88 

18.86 

25 

117.40 

27.93 

3.52 

22.57 

30 

140.49 

32.60 

3.54 

26.29 

32 

149.74 

37.03 

3.86 

29.70 

35 

164.69 

37.20 

4.39 

30.01 

40 

188.18 

41.84 

4  39 

33.67 

Table  1:  Results  of  Sequential  and  Parallel  Experiments 

As  our  results  illustrate,  for  the  general  case,  the  par¬ 
allel  implementation  based  on  the  feist  Mobius  trans¬ 
forms  (algorithm  3.4)  is  the  most  efficient.  However,  for 
many  real  applicat  ions  where  most  of  the  subsets  have 
zero  masses,  the  parallel  implementation  based  on  the 


improved  direct  combination  (algorithm  3.5)  is  still  the 
most  efficient^ . 

Further  work  is  being  carried  out  to  minimize  redun¬ 
dant  computations  in  a  Mobius  transform  and  explore 
parallelism  in  Dempster’s  rule.  Methods  working  with 
continuous  data  are  also  being  investigated  with  an  ap¬ 
plication  to  our  domain  of  liver  disecise  diagnosis. 
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1.5  hours  on  our  Sequent  machine. 
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Abstract 

Dempster  [1]  has  characterized  the  dynamic  linear 
model  (DLM)  as  a  probabilistic  belief  network,  showing 
that  recent  algorithms  for  propagation  of  information 
in  such  networks  generalize  Kalman  filtering,  predic¬ 
tion  and  smoothing  algorithms  for  the  DLM.  Recently 
the  Bayesian  network  technology  has  been  extended  to 
model  mixed  discrete  and  continuous  random  variables 
using  conditional  Gaussian  (CG)  distributions  [5]  with 
analogous  propagation  schemes  [6].  This  paper  applies 
the  theory  of  CG  probability  networks  to  character¬ 
ize  the  multiprocess  dynamic  linear  model  (MPDLM) 
and  its  requisite  computations  in  a  unified  way.  The 
complexity  of  exact  computations  is  determined  and 
approximate  methods  are  proposed. 


1  Introduction 

In  this  paper  we  apply  the  theory  of  conditional 
Gaussian  networks  to  a  class  of  dynamic  linear  models 
that  incorporate  uncertainty  as  to  the  underlying  gen¬ 
erating  model.  This  class  of  models  has  the  property 
that  its  dependency  structure  can  be  modelled  graph¬ 
ically.  The  resulting  graph  falls  under  an  umbrella  of 
names;  a  causal  probability  network,  a  Ba}  es  belief  net, 
a  causal  network,  a  Bayes  network,  or  an  infiuence  dia¬ 
gram.  Interest  centers  on  the  posterior  distributions  of 
various  sets  of  random  variables.  The  motivation  for 
using  a  graphical  representation  is  for  computational 
convenience;  the  calculations  are  reduced  to  a  series  of 
efficient  local  computations.  To  implement  the  compu¬ 
tations,  the  graph  is  transformed  into  another  structure 
called  a  junction  tree  [4].  It  is  in  the  junction  tree  that 
the  calculations  are  performed. 


2  Multiprocess  Models 


2.1  The  Dynamic  Linear  Model 

The  dynamic  linear  model  [3]  is  a  discrete  time 
linear  model  that  captures  a  variety  of  familiar  models: 
regression  models,  time-dependent  covariate  models, 
exponential  smoothing  models  and  linear  time-series 
models.  The  series  described  by  the  DLM  is  a  2-stage 
hierarchical  model  with  stage  1  of  the  hierarchy  defined 
by  the  observation  equation  and  the  second  stage  de¬ 
scribed  by  the  system  equation.  The  system  equation 
describes  how  the  underlying  process  that  drives  the 
observed  series  evolves  with  time  t.. 

Y/  =  Xt^t  -I-  f(  observation  equation 

(3i  =  G,/3,_i  4-  U)  system  equation 

where  /3(  is  a  p  x  1  state  vector  ,  G(  is  a  p  x  p 
known  transition  matrix,  Uf  is  a  p  x  1  vector  of  sys¬ 
tem  errors.  If  is  a  r  x  1  observation  vector,  Xt  is  a 
r  X  p  known  regressor  matrix,  and  C)  is  a  r  x  1  vector 
of  observation  errors.  It  is  assumed  that  ~  inde¬ 
pendent  Yp(0,  V',),  fi  ~  independent  iVr(07E())  where 
V,  and  E;  are  known  for  all  t  >  0.  We  also  assume  for 
simplicity  U(  and  e(  are  mutually  independent. 

Suppose  we  have  prior  information  available 
about  All  say  /3o  ~  Np{fiu,  So).  Interest  centers  on  in- 
fererences  for  /3,,  <  =  1, 2,  ■  •  If  we  denote  the  present 
time  by  T,  then  for  <  <  T,  i  =  T  and  t  >  T 
the  problem  becomes  one  of  smoothing,  filtering  and 
forecasting  respectively.  Recursive  equations  for  fil¬ 
tered,  smoothed  and  forecasted  estimates  are  available 
[7,  Pages  216-224], [8]. 
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2.2  An  Extension:  MPDLMs 

The  class  of  multiprocess  models  [3]  reflect  un¬ 
certainty  about  the  model  by  formally  allowing  the 
model  generating  the  process,  henceforth  the  gener¬ 
ating  model,  at  any  given  time  to  be  a  random  choice 
from  a  discrete  number  of  alternative  DLMs. 

Denote  the  model  at  time  i  with  generating 
model  j  by  for  j  =  where  N  repre¬ 

sents  the  total  number  of  alternative  model  choices. 
Assume  that  the  probability  that  model  j  obtains  at 
time  t  is  It  can  be  shown  that  by  appropriately 

characterizing  the  system  and  observation  variances,  a 
set  of  DLMs  can  be  constructed  that  reflect  level  and 
trend  changes  in  the  series  as  well  as  accommodate  ob¬ 
servation  outliers  [3]. 

The  estimation  method  proceeds  in  the  same 
manner  as  that  for  the  DLM.  However,  passing  from 
time  t  to  t-hl,  N'^  posteriors  are  obtained.  This  number 
increases  with  time,  indicating  the  need  for  an  approx¬ 
imation.  One  technique  [3]  approximates  the  mixture 
of  Gaussian  posteriors  by  a  Gaussian  with  a  mixture 
mean  and  mixture  variance.  The  mixture  mean  is  cal¬ 
culated  by  weighting  the  posterior  mean  for  each  gen¬ 
erating  model  by  the  posterior  probability  that  model 
obtained  at  time  t  and  summing  over  all  possible 
generating  models.  A  similar  characterization  of  the 
mixture  variance  holds. 


3  Junction  Trees 

A  junction  tree  [4]  J  for  a  graph  is  a  tree  whose  nodes 
are  the  cliques  of  the  graph,  and  separator  sets  5,,  as¬ 
sociated  with  the  edges,  which  are  the  intersections  of 
each  clique  C,  with  its  parent.  The  defining  property 
of  a  junction  tree  is  that  if  Cj  and  Cj  have  elements 
in  common,  all  the  separators  on  edges  connecting  C, 
and  Cj  in  J  contain  those  common  elements. 

Propagation  algorithms  [4]  for  computing  clique 
marginals  in  the  junction  tree  involve  only  local  op¬ 
erations  between  neighboring  cliques.  The  clique  size 
determines  the  complexity  of  the  operations.  The  algo¬ 
rithms  lend  themselves  to  object-oriented  inplementa- 
tion  and  parallel  processing.  The  pattern  common  to 
these  algorithms  is;  to  propagate  information  from  one 
clique  to  another,  a  marginal  for  the  separator  on  the 
link  connecting  the  two  cliques  is  taken  in  the  source 
clique,  and  then  that  marginal  multiplies  a  conditional 
distribution  calculated  in  the  destination  -rlique. 


4  Graphical  Representation 

The  causal  graph  D  for  the  MPDLM  is  given  in  Figure 
1.  The  generating  model  at  time  t  is  indicated  by  the 
generating  variable  L .  Dropping  directions  and  join¬ 
ing  parents  yields  an  undirected  graph  G  with  cliques 
of  the  form  (/,,  A- i , /9()- (A.  P/)>  <  =  1>2,  ...,T  -f-  K, 
where  T  is  the  current  time  and  prediction  is  K  steps 
ahead.  The  potential  for  the  first  clique  is  given  by 
the  system  equation  multiplied  by  the  prior  for  I,  and 
the  potential  for  the  second  clique  is  given  by  the  ob¬ 
servation  equation.  Additionally,  the  prior  for  (3,,  is  a 
factor  in  the  potential  on  (/i , /3o, /3i  )■  The  joining  of 
parents  has  insured  that  these  potentials  are  defined 
on  the  cliques  of  G.  Thus  G  can  form  the  basis  for  a 
junction  tree. 

4.1  CG-junction  trees  and  the 
MPDLM 

Lauritzen  and  Wermuth  [5]  proposed  modeling  mixed 
discrete  and  continuous  random  variables  using  CG  po¬ 
tentials.  Lauritzen  [6]  gave  approximate  algorithms 
based  on  CG  potentials  which  maintain  a  CG  repre¬ 
sentation.  [6]  assumes  that  the  joint  distribution  be 
expressed  as  a  product  of  CG-potentials  on  the  cliques 
of  G  and  the  existence  of  a  junction  tree  for  which  each 
separator  S*  satisfies  the  following  additional  property: 

S*  C  AorCjt\  C  r  (1) 

where  A  is  the  set  of  discrete  variables  and  F  is  the 
set  of  continuous  variables.  We  refer  to  a  junction  tree 
with  the  above  property  as  a  CG-junction  tree.  The  al¬ 
gorithm  [6]  is  approximate  in  that  it  employs  an  opera¬ 
tion  called  weak  marginalization  when  propagating  in¬ 
formation  away  from  the  root  of  the  CG-junction  tree. 
The  weak  marginal  approximates  the  true  marginal  by 
a  CG  distribution  with  compatible  moment  properties. 

A  junction  tree  satisfying  (1)  exists  if  and  only 
if  it  represents  the  clique  structure  of  a  graph  G'  w’hich 
1)  does  not  contain  any  path  between  two  non-adjacent 
discrete  vertices  passing  through  only  continuous  ver¬ 
tices  and  2)  is  triangulated.  The  first  condition  can  be 
satisfied  for  the  MPDLM  only  if  all  discrete  vertices 
are  adjacent.  It  follows  that  there  will  be  a  clique  of 
G'  containing  all  T  -i-  A  discrete  variables,  and  hence 
a  multivariate  marginal  distribution  of  very  high  di¬ 
mension.  After  connecting  each  I,  to  all  other  gener¬ 
ating  variables  we  must  triangulate  the  resulting  graph 
by  filling  in  edges  to  break  cycles  of  length  four  or 
greater.  We  used  the  method  of  [lOj.  The  cliques  of 
the  triangulated  graph  are  next  organized  in  a  junc¬ 
tion  graph,  where  cliques  with  nonempty  intersection 
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are  adjacent.  Finally,  we  determine  a  sj>anning  tree 
Jca  satisfying  (1).  The  root  clique  is  not  arbitrary.  In 
fact,  the  root  clique  must  have  CG  distribution.  Figure 
2a  gives  a  CG-junction  tree  Jca  for  the  MPDLM.  In 
the  figure,  the  rectangles  are  separators  and  the  round 
nodes  are  cliques.  We  have  chosen  a  fill-in  which  yields 
Jca  rooted  at  time  T.  This  has  advantages  for  imple¬ 
mentation  and  also  allows  the  distribution  of  (Jj  to  be, 
in  theory,  exactly  known  at  time  t.  The  updating  of 
clique  distibutions  to  the  right  of  the  root  is  prediction, 
smoothing  is  in  the  left  subtree,  and  conditioning  the 
distribution  of  the  root  on  yj  is  filtering.  As  the  cur¬ 
rent  time  is  incremented  to  T  -|-  1,  the  tree  grows  by 
adding  leaves  for  T  /f  -I-  1  and  identifying  the  root 
with  T  +  \.  In  general,  the  root  is  the  only  clique  whose 
distribution  is  known.  Only  the  moment  characteris¬ 
tics  of  other  cliques  are  known. 

Although  this  method  allows  recursive,  local 
computation,  it  does  not  solve  the  computational  prob¬ 
lem.  The  root  clique  must  store  the  mean  and  vari¬ 
ance  for  /3r,  and  a  probability,  for  each  cell  in  the  high 
dimensional  table  formed  by  all  combinations  of  the 
T  +  K  generating  variables.  For  N  models,  the  com¬ 
plexity  is  of  order  .  There  is  no  hope  of  avoiding 

this  with  Lauritzen’s  method,  since  no  matter  how  we 
contract  the  CG-junction  tree,  it  must  include  a  clique 
containing  ail  of  the  state  variables. 

4.2  An  approximate  topology 

Lauritzen  has  suggested  reducing  computations  by  car¬ 
rying  the  idea  of  weak  marginalization  further,  intro¬ 
ducing  weak  marginalizations  when  propagating  to¬ 
ward  the  root.  This  is  equivalent  to  implementing  Lau¬ 
ritzen’s  method  in  a  modified  CG-junction  tree,  illus¬ 
trated  in  Figure  2b.  Essentirdly,  when  we  propagate 
evidence,  we  ’forget’  all  but  R  generating  variables. 

The  junction  tree  provides  a  unified  computa¬ 
tional  framework.  Filtering,  prediction,  and  smoothing 
are  seen  to  be  the  same  operation,  evidence  propaga¬ 
tion,  in  different  parts  of  the  tree.  The  sequence  of  data 
collection  is  arbitrary.  At  time  any  time  a  missing  ob¬ 
servation  can  be  ’found’  and  its  influence  propagated 
thoughout  the  tree.  We  thus  generalize  all  operations 
in  the  MPDLM. 

The  CG-juntion  tree  for  iZ  =  2  duplicates  the 
filtering  calculations  of  [9].  The  power  of  the  network 
representation  is  that  for  any  R  it  also  implements  pre¬ 
diction,  smoothing,  and  the  handling  of  non-sequential 
(e.g.  missing  or  delayed)  data  collection.  For  filter¬ 
ing,  [9]  report  that  the  approximation  for  iZ  =  2  is 
adequate,  but  [2]  express  dissatisfaction  with  the  ap¬ 
proximation.  Work  needs  to  be  done  to  investigate  the 


adequacy  of  the  method  for  different  choices  of  JZ,  for 
filtering,  smoothing,  and  prediction. 
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Figure  1 .  The  causal  graph  for  the  MPDLM.  Continuous  variables  are  circles, 
discrete  variables  are  dots. 


Figure  2.  CG  -  Junction  Trees  For  The  MPDLM.  a)  the  full  tree  is  given  by 
a-L-1,n-U«T  +  K  b)an  approximate  tree  nvilh  rartge  R  is  given  by  a 
•  T-R  +  1,n»T  ♦R-1,  L»  upper  bound  -  R  ♦  1.  U  •  tower  bound  ♦  R  - 
1 ;  e  g.  L  S  1  S  T  gives  L  ■  T  -  R  ♦  1 . 
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t  Introduction 

Tonventional  statistics  packages  such  as  S  [3]  or  SAS  (I0|. 
lave  a  limited  choice  of  data  structures  for  representing  sta- 
istical  datasets.  In  particular,  for  an  operation  to  be  invoked 
)n  a  dataset  or  a  subset  of  the  data,  the  data  has  first  to  be 
:onverted  to  a  specific  format,  typically  1-d  or  2-d  arrays, 
fliis  leads  to  confusion  for  the  analyst  who  ends  up  having 
o  juggle  many  versions  of  the  same  data.  In  addition  the 
inalyst  has  to  remember  the  connections  between  the  differ- 
mt  versions,  and  connections  between  analysis  results  and 
he  original  data. 

Multiple  data  versions  are  especially  problematical  for  in- 
eractive  statistical  graphics.  Ideally,  a  plot  acts  as  graphical 
interface  to  the  underlying  data,  allowing  all  sorts  of  queries 
!uch  as  requests  for  information  on  individuals  and  variables 
n  the  dataset.  The  plot  could  then  be  modified  by  choosing 
1  new  variable  to  replace  one  currently  appearing  in  the  plot, 
'lone  of  this  is  possible  if  a  scatterplot  (for  example)  was 
:onstructed  using  two  1-d  arrays  extracted  from  the  data. 

Interactive  techniques  for  linking  plots  such  as  painting 
brushing)  [2,7,8]  require  that  the  system  be  able  to  deter- 
nine  which  point  (if  any)  in  one  plot  ‘corresponds'  to  a  point 
n  another  plot.  Corresponding  points  typically  represent  the 
;ame  dataset  individual,  so  multiple  data  versions  are  a  nui- 
:ance:  either  the  analyst  or  the  system  has  to  remember  the 
:onnections. 

In  this  note  I  consider  the  domain  of  interactive  and  dy- 
lamic  graphics.  I  describe  how  such  a  graphics  system  need 
lot  enforce  a  particular  choice  of  data  representation.  By 
dentifying  the  components  of  the  plot-data  interface,  1  con- 
truct  an  abstraction  harrier  between  plot  and  data.  Imple- 
nentation  of  the  interface  relies  on  generic  functions,  (sec. 
or  example  Steele  [11],  or  Keene  [6])  which  may  then  be 
pecialized  for  an  arbitrary  data  representation.  The  need  for 
nultiple  data  versions  is  reduced  by  encorporating  a  general 
lata  transformation  capability  within  the  plot  system. 

The  statistical  graphics  system  referred  to  here  is  part  of 
he  forthcoming  Zed  system  (9),  and  described  in  Hurley 
nd  Oldford  [.5|.  Implementation  is  in  Common  Lisp  and 
'LOS  [11.6).  so  examples  given  here  use  a  small  amount 
if  Common  Lisp  syntax. 


2  Plot-data  connection 

Using  examples,  the  benefits  of  a  general  plot-data  inter¬ 
face  are  discussed.  A  software  model  for  statistical  graphics 
ba-sed  on  a  tight  coupling  of  plot  and  data  is  outlined. 

The  following  data  taken  from  Andrews  and  Herzberg  [1] 
will  be  used  as  an  example.  There  are  42  apple  trees  in  a 
designed  experiment  with  4  treatments  and  4  blocks,  with  8 
qualitative  variables  measured  on  the  fruit  from  each  tree. 
Each  treatment  x  block  combination  initally  had  4  trees,  but 
some  trees  bore  no  fruit.  Figure  1  shows  a  scatterplot  matrix 
of  4  variables,  the  lower  left  plot  displays  treatment  number 
versus  block  number  for  each  of  the  !6  groups,  and  the  plot 
on  the  lower  right  shows  mean  weight  for  each  of  the  16 
groups  plotted  against  block  number. 

2.1  Conventional  plot-data  interface 

At  this  time,  many  statistical  graphics  systems  are  primarily 
drawing  programs,  yielding  static  plots  and  supporting  little 
or  no  interaction  (for  example,  commonly  available  versions 
of  S  [3]).  For  purposes  of  illusiraiion.  1  describe  a  plot-data 
interface  for  such  a  system. 

The  convention  in  statistics  packages  is  to  represent  data 
by  multi-way  arrays.  For  example,  the  apple  data  could  be 
a  2-d  array  with  each  row  representing  a  tree.  Typically 
plotting  functions  require  as  arguments  one  or  more  1-d  ar¬ 
rays  (depending  on  the  dimensionality  of  the  plot),  so  the 
plot-data  interface  consists  of  selecting  slices  of  the  data  for 
plotting.  Suppose  two  columns  arc  selected  for  a  scatterplot, 
then  each  pair  of  column  entries  becomes  the  subject  of  a 
point  in  the  plot. 

To  construct  cither  of  the  lower  plots  in  figure  1  requires 
constructing  a  new  1-d  array  whose  entries  are  averages  of 
weight  for  each  treatment  x  block  combination.  Other  pos¬ 
sible  plots  would  use  medians  instead  of  averages  for  exam¬ 
ple.  or  weight  rather  than  calcium,  but  we  must  first  compute 
new  1-d  arrays  for  these  quantites. 

The  advantage  of  such  a  plot-data  interface  is  that  it  is 
familiar  and  relatively  easy  to  work  with.  The  disadvantage 
is  that  we  must  first  of  all  convert  the  data  to  arrays,  and  then 
consUiict  new  arrays  for  derived  variables.  As  the  analysis 
becomes  more  involved  it  is  easy  to  kxtsc  track  of  all  the 
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derived  data  and  their  inter-relationshiops. 


Figure  1 

2.2  Interactive  graphics 

The  plots  displayed  in  figure  1  are  both  interactive  and  dy¬ 
namic.  An  interactive  plot  is  one  with  which  the  user  can 
communicate,  whilst  dynamic  plots  arc  plots  that  can  change 
instantaneously,  typically  but  not  necessarily  in  rcspon.se  to 
a  u.scr  action.  Of  cou.se,  interactive,  dynamic  plots  rc.|uirc  a 
richer  plot-data  interface. 

We  use  a  point  and  click  style  interface  to  retrieve  in¬ 
formation  on  the  underlying  data:  for  the  .scattcrplot  matrix 
selecting  the  ‘identify'  operation  on  a  point  gives  the  name 
of  the  tree  it  represents,  whilst  in  the  lower  plot  ‘identify’ 
returns  the  names  of  all  trees  in  the  group  represented  by  a 
point.  .Similarly  the  ‘inspect  data'  operation  returns  all  in¬ 
formation  available  on  the  tree  (or  trees)  represented  by  the 
selected  point. 

Changing  to  another  variable  provides  a  simple  example 
of  plot  modification.  The  new  variable  could  be  present  in 
the  dataset,  or  a  derived  variable  computed  using  a  trans¬ 
formation  of  existing  variables.  Buja  ct  al.|4)  describe  a 


series  of  data  transformations  called  a  viewing  pipeline  for 
obtaining  plot  coordinates  from  the  data,  where  any  pipeline 
clement  could  be  modified  resulting  in  an  updated  plot. 
The  plot-data  interface  should  accomcxlatc  such  a  viewing 
pipeline,  so  plot  system  rather  than  the  analyst  kxiks  after 
computation  of  derived  variables,  and  subsequent  plot  up¬ 
dating. 

Linking  of  plots  using  interactively  modifiable  drawing 
style  attributes  of  points  is  another  common  example  of  dy¬ 
namic  graphics.  In  figure  1  each  of  the  4  treatment  groups 
are  represented  by  a  different  plotting  symbol. 

In  standard  implementations  of  linking  [2.12,13],  each 
plot  has  one  point  per  case,  and  all  points  representing  a 
case  arc  required  to  have  the  same  drawing  style.  This  is 
also  true  for  points  contained  in  the  scattcrplot  matrix  of 
figure  1.  However,  in  the  lower  plots  each  point  represents 
the  bees  (typically  4)  in  a  particular  treatment  x  block  com¬ 
bination.  In  fact,  all  three  plots  are  linked.  The  lower  left 
plot  was  constructed  specifically  so  that  selecting  a  partic¬ 
ular  treatment  x  block  combination  would  be  easy:  using 
the  painting  operation  we  color  the  points  for  the  fourth 
treatment  red  say.  then  all  points  representing  trees  in  this 
treatment  group  change  to  red.  Conversely,  we  could  color  a 
single  point  in  one  of  the  scattcrplot  matrix  panels  blue,  and 
all  points  representing  the  same  tree  change  to  blue.  This 
tells  us  immediately  the  treatment  and  blwk  assigned  to  the 
selected  tree. 

More  accurately,  a  point  in  the  lower  plot  now  represents 
3  red  trees  and  1  blue  tree,  but  because  a  point  is  small  we 
do  not  allow  proportional  coloring.  (We  do  however  use 
proportional  coloring  for  the  bars  of  a  histogram  or  barplot.) 
One  could  draw  a  point  using  its  majority  color  (red),  and 
use  blue  for  short  time  following  the  color  change  of  the 
linked  point.  This  way  we  still  obtain  the  subset  membership 
information  for  the  changed  point. 

2.3  Plot  design 

In  the  previous  section  we  noted  that  many  plot  modifica¬ 
tions  in  dynamic  graphics  are  characterized  by  modifying 
a  transformation  applied  to  the  data.  Also,  for  linking  the 
system  needs  to  know  which  point  represents  which  piece  of 
data.  Here  I  outline  an  organisational  scheme  for  statistical 
graphics,  where  the  system  keeps  track  of  the  associations 
between  plot  and  data.  More  details  arc  given  in  Hurley  and 
Oldford  [5]. 

Statistical  plots  are  collections  of  objects  such  as  points, 
lines,  labels  and  axes,  rhese  objects  may  be  arranged  in  a 
hierarchy-  a  scatten’Iot  consists  of  axes,  label  and  a  point- 
.oud  which  itself  consists  ot  points.  Similarly  a  scattcrplot 
matrix  consists  of  pointclouds  and  labels,  though  arranged 
in  a  different  format.  An  object  appearing  in  the  plot  has 
an  assexiated  piece  of  statistical  data-  for  the  scatterplot  it  s 
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the  entire  dniasel.  for  a  point  it’s  typically  a  case  and  for 
the  pointcloud  the  collection  of  cases.  Each  component  of  a 
plot  we  term  a  view,  so-called  because  it  provides  a  graph¬ 
ical  representation  of  some  piece  of  data,  called  the  viewed 
object.  A  view  object  contains  a  reference  to  its  viewed  ob¬ 
ject,  and  an  image  of  a  view  is  used  as  a  graphical  interface 
to  the  viewed  object. 

The  following  discussion  relies  on  such  a  conceptual 
model  for  statistical  graphics. 

3  A  general  plot-data  interface 

In  this  section,  a  general  plot-data  interface  is  described, 
without  assuming  any  particular  data  representation.  We 
will  u.se  the  scatterplot  as  an  example. 

3.1  Plot  construction 

Suppose  we  invoke  the  following  function: 

(scatplot  :data  apple  :x  "calcium" 

:y  ’(log  "wgt")) 

This  builds  a  .scatterplot  of  the  apple  data  with  "calcium" 
on  the  x-axis  and  (log  "wgt")  on  the  y-axis.  The  steps 
involved  arc; 

1.  Construct  a  scatterplot  with  the  data.set  apple  as  its 
viewed  object. 

2.  The  scatterplot  then  extracts  subjects  (trees)  from  the 
data.sct.  consU'ucts  a  pointcloud  and  two  axes  each 
viewing  the  subjects.  consU'ucts  title  and  axis  labels, 
and  assigns  positions  to  these  views. 

3.  The  pointcloud  in  turn  consU'ucts  one  point  object  for 
each  subject,  and  positions  the  points  by  exU'acting 
"calcium"  and  (log  "wgt")  from  the  subjects. 

4.  The  x-axis  computes  tic  marks  and  tic  labels  by  cxU^act- 
ing  "calcium"  coordinates  from  the  subjects,  similarly 
the  y-axis. 

There  are  two  stages  where  the  plot  obtains  information  from 
the  data: 

1.  In  construction  of  subjects.  The  .scatterplot  by  default 
uses  the  list-subjects  function. 

2.  For  obtaining  c(X)rdinatcs  from  the  subjects.  The 
pointcloud  obtains  the  x-coordinates  by  applying  the 
value-ol  function  to  each  subject  with  "calcium" 
as  argument.  For  the  y-coordinate  the  subjects  do  not 
have  a  variable  called  (log  "wgt"):  this  a.ssumes  the 
value-ol  function  will  extract  "wgt"  and  compute  the 
log. 


These  steps  are  easily  generalized  to  arbitrary  data  represen¬ 
tations  using  generic  functions.  A  generic  function  differs 
from  an  ordinary  function  in  that  its  implementation  is  dis¬ 
tributed  across  one  or  more  methods.  When  a  generic  func¬ 
tion  is  invoked,  there  is  an  automatic  mechanism  in  place 
that  chooses  a  method  approriate  to  the  arguments,  where¬ 
upon  that  method  is  executed  and  its  values  returned,  see 
for  example  |6.11].  The  implementor  of  the  graphics  pack¬ 
age  assumes  that  methods  for  the  generic  functions  exist. 
In  order  to  use  the  graphics  package.the  implementor  of  a 
dataset  should  define  appropriate  methods  list-subjects 
and  value-of . 

As  demonstrated  by  figure  1 .  for  a  given  dataset  there  are 
many  possible  interpretations  of  subject.  In  the  lower  plots 
a  subject  is  a  number  of  trees  instead  of  just  one.  Note 
that  any  subset  of  the  data  may  be  considered  a  subject.  In 
general  a  subject  will  be  a  data  item  (typically  a  case)  or  a 
list  of  data  items.  Therefore,  as  in  the  following  example, 
we  may  supply  the  plot  with  a  function  to  use  to  extract  the 
subjects: 

(scatplot  ;data  apple 
;x  ’treat-no 
:y  "wgt"  :y-f unction 
;subj-fn  ’treat. block) 

•  Subjects  are  obtained  by  applying  treat  .block  to  the 
dataset,  rather  than  the  default  list-subjects,  yield¬ 
ing  a  list  of  subjects,  where  each  subject  is  a  list  of 
trees  corresponding  to  a  particular  treatment  x  block 
combination. 

•  The  x-coordinate  is  obtained  by  applying  the  treat-no 
function  to  the  list  of  trees. 

•  The  y-coordinate  is  obtained  by  extracting  "wgt"  from 
the  subjects,  then  applying  the  function  mesm. 

This  assumes  functions  treat  .block,  treat-no  and  mean 
have  been  appropriately  defined. 

The  plot  system  uses  additional  generic  functions  to  ex¬ 
tract  (i)  the  dataset  name  (used  for  the  plot  title),  (ii)  a  list 
of  variables  (which  arc  used  when  constructing  a  menu  for 
choosing  a  new  variable)  (iii)  a  subject  label,  and  (iv)  for 
inspecting  the  underlying  data.  When  necessary,  the  default 
generic  functions  may  be  overidden  by  providing  additional 
arguments  to  the  view  constructors. 

3.2  Data  transformations 

Wc  extend  the  plot-data  interface  to  include  a  series  of  data 
transformations,  so  the  plot  system  rather  than  the  analyst 
looks  after  computation  of  derived  variables,  and  subsequent 
plot  updating. 


Interface  Issues  for  Interactive  Graphics  1 75 


The  discussion  in  the  previous  section  suggests  modifying 
the  plot  by  (i)  changing  subjects  and  (ii)  changing  coordi¬ 
nates. 

The  subject  selection  may  be  modified  by  deleting  or 
adding  in  subjects.  Most  generally,  subjects  would  be  mod¬ 
ified  by  supplying  a  new  function  to  extract  them  from  the 
data.set.  However,  this  could  result  in  a  plot  bearing  little 
relation  to  the  original,  and  so  it  is  Just  as  easy  to  make  a 
new  plot. 

There  are  three  steps  involved  in  extracting  say  the  x- 
coordinates  from  the  data: 

1.  Extract  the  subject  value  (or  values).  The  value  ex¬ 
tracted  is  specified  by  the  :x  argument,  which  may 
be  any  legal  argument  to  value-of.  In  my  imple¬ 
mentation  using  a  simple  dataset  representation  (he 
value-of  arguments  can  be  used  to  extract  func¬ 
tions  like  (log  "wgt")  ,  or  linear  combinations  '(+ 
"carbon"  "wgt").  The  ; x  aigumcnt  may  also  be  a 
function  which  is  then  applied  to  the  subjects. 

2.  Transform  the  value  from  step  1  to  a  single  real  number. 
This  transformation  is  spcciticd  by  the  :x-f unction 
argument  (the  default  is  the  identity  transformation). 
The  transformation  could  be  log  or  square  rtxrt,  assum¬ 
ing  step  1  returns  a  single  value,  or  the  mean  or  median 
function  when  step  1  returns  a  list  of  values.  Of  course, 
we  could  eliminate  this  step  (incorporate  it  in  1)  but  this 
way  is  simpler  for  the  analyst. 

3.  Transform  the  values  from  step  2  from  R"  to  R" .  where 
n  is  the  number  of  subjects.  The  transformation  is  spec¬ 
ified  by  the  :  x-translorm  argument,  again  the  default 
is  the  identity.  This  step  allows  for  projections  of  vari¬ 
ables,  common  in  linear  regression.  With  transforma¬ 
tions  defined  for  the  space  spanned  by  a  selection  of 
predictor  variables  and  the  orthogonal  subspacc.  one 
immediately  obtains  residual  plots  and  added-variable 
plots. 

Any  of  the  above  arguments  may  be  changed  to  modify 
the  plot  coordinates,  cither  by  a  command-style  interface 
or  selecton  from  a  menu.  For  1),  the  menu  offers  choice 
of  all  variables  known  to  subjects  in  plot,  for  2)  (he  menu 
offers  choices  like  square  rtxrt  and  log,  while  the  menu  for 
step  3  is  empty  by  default.  The  user  can  add  other  choices 
to  the  menus  for  steps  1  and  2.  or  add  transforms  to  the 
transformation  menus. 

3.3  Linking  view.s 

Usually  we  link  views  in  the  sense  of  using  common  draw¬ 
ing  style  attributes  (color,  shape,  size)  when  displaying  a 
data  item.  In  the  example  of  figure  1,  points  whose  viewed 
objects  are  identical  or  have  a  non-empty  intersection  arc 


linked.  A  generic  function  eqc  is  used  to  compare  a  pair  of 
data  items  to  see  if  their  views  may  be  linked.  The  default 
Just  checks  for  identity.  Linking  can  be  u,sed  with  alterna¬ 
tive  data  representations  simply  by  defining  the  appropriate 
method  for  eqc.  For  instance,  if  dataset  individuals  were 
identified  by  position  or  by  a  label,  the  eqc  method  could 
test  for  identical  positions  (labels). 
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Abstract 


Despite  the  recent  explosion  of  software  development  and 
computer  programs  capable  of  bringing  dynamic  visual  data 
analytic  techniques  to  a  wide  range  of  users,  little  empirical 
evidence  has  been  offered  to  justify  or  support  claims  about 
the  potential  usefulness  and  efficacy  of  dynamic  graphical 
data  analytic  procedures  as  a  class.  In  the  current  investiga¬ 
tion,  anificial  data  with  a  known  three-dimensional  “clus¬ 
tery”  structure  was  submitted  to  a  sophisticated  data 
“visualizer”  who  attempted  to  identify  the  structure  in  the 
data.  Additional  variables  which  were  taken  into  account 
included  (1)  the  number  of  Uue  clouds  of  points  present  in 
each  data  set,  (2)  the  number  of  data  points  per  each  cloud, 
and  (3)  the  distance  between  pairs  of  clouds  within  a  data 
set.  Results  indicate  that  distance  between  clouds  relates 
positively  to  the  accuracy  of  cloud  membership  Judgments.  ^ 


1.0  Purpose 


structure  of  artiHcial  three-dimensional  data  spaces  gener¬ 
ated  by  the  first  and  third  authors.  Each  data  space  consisted 
of  1  or  more  trivariate  normal  point  clouds. 

A  priori,  one  might  expect  the  success  of  a  subject  in  this 
task  to  depend  on  the  separation  of  the  true  clouds  of  points, 
and  on  the  internal  compacmess  of  the  true  clouds  of  points. 
Perhaps  the  greater  the  separation  between  two  clouds  rela¬ 
tive  to  their  compactness,  the  more  accurate  will  be  the  iden¬ 
tification  of  the  correct  cluster-structure.  Furthermore, 
clouds  containing  many  data  points  may  be  more  accurately 
identified  than  clouds  with  few  points. 


2.0  Design  and  Methods 


Twenty-four  trivariate  data  sets  were  constructed  by  the  first 
and  third  authors.  Data  sets  contained  between  one  and  six 
clouds  of  data  points  randomly  sampled  from  trivariate  nor¬ 
mal  distributions  with  a  constant  variance-covariance  of 


Does  the  visual  exploration  of  multivariate  data  using 
dynamic  three-dimensional  spinning  scatterplots  (which  we 
call  “3D  spinplots”)  provide  a  sophisticated  user  with  infor¬ 
mation  and  insights  about  the  data  he  is  examining?  Do  3D 
spinplots  let  an  experienced  visualizer  identify  teal  structure 
which  exists  in  the  data?  For  virtually  anyone  who  has  seen 
videos  of  3D  spinplot  software  such  as  PRIM-9  (Tukey, 
Friedman  and  Fisherkeller,  1973),  Dataviewer  (Buja  & 
Tukey,  1987)  or  VISUALS  (Young  &  Reinghans,  1991),  the 
answer  would  unambiguously  be  “quite  possibly”.  Unfortu¬ 
nately,  a  more  definitive  answer  is  not  available  since  these 
methods  have  not  been  evaluated  with  data  having  known 
structure. 


I  = 


4.0  1.2  1.2 
1.2  4.0  1.2 
1.2  1.2  4.0 


.  The  cenuoid  of  each  cloud  within  a  data 


set  was  positioned  (within  sampling  error)  at  a  discrete  loca¬ 
tion  (x,y,z)  on  a  three  dimensional  lattice  (where 
xe  {-2.5, 0,2.5};  ye  { -2.5,0, 2.5}  ;  &  ze  {-2.5, 0.  2.5}  ). 
Sizes  of  individual  clouds  within  a  data  set  varied  to  contain 
either  'Large'  (approximately  60),  ‘Medium’  (approximately 
30),  or  ‘Small’  (approximately  10)  numbers  of  data  points. 
In  addition,  three  data  sets  contained  a  supplemental  ‘Tiny’ 
cloud  of  3  data  points,  one  data  set  contained  an  ‘X-Laige’ 
cloud  of  120  points,  and  another  an  ‘XX-Large’  cloud  of  180 
points. 


The  current  investigation  was  undertaken  to  directly  address 
the  question  of  how  well  and  to  what  degree  3D  spinplots 
permit  an  experienced  user  to  identify  real  structure  which 
has  been  built  into  sets  of  artificially  generated  data.  Specifi¬ 
cally,  the  second  author,  acting  as  a  subject  in  this  experi¬ 
ment,  attempted  to  accurately  identify  the  true  Cluster- 


Using  the  VISUALS  (Young  &  Kent,  1987)  data  visualiza¬ 
tion  software  system  on  a  22Mhz.  80386-based  microcom¬ 
puter  equipped  with  a  640x480  pixel  VGA  monitor,  the 
second  author  was  presented  with  the  task  of  identifying  the 
cluster  structure  of  the  data  points  by  classifying  them  into 
subjective  groups.  Initially,  the  data  points  appeared  as  white 
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dots  on  a  black  background  with  no  cues  to  the  true  structure 
of  the  data  clouds  apparent.  The  subject  knew  only  that  each 
data  set  contained  between  one  and  six  data  clouds  and  that 
they  were  generated  by  sampling  from  trivariate  normal  dis¬ 
tributions.  He  was  not  informed  of  the  exact  number  of 
clouds  in  a  data  set,  their  relative  sizes,  shapes  or  locations, 
or  even  that  the  trivariate  normal  distribution  used  to  gener¬ 
ate  all  clouds  had  the  same  variance-covariance  structure. 

The  subject  spun  the  three  dimensional  visualization  space 
and  formed  subsets  using  the  VISUALS  software  system. 
He  worked  at  his  own  schedule,  dividing  his  work  into  6  ses¬ 
sions,  spending  about  17  minutes  per  dataset  (range:  2-55 
minutes;  standard  deviation:  12  minutes). 


/  II  III  IV 

A 
B 

C=  C 
D 
E 

Note  that  elements  of  C  indicate  the  frequency  with  which 
members  from  different  clouds  (rows)  were  judged  to  be  in 
the  same  subset  (columns). 

4.0  Objective  Cloud  Measures 


5  0  0  0 
0  6  0  0 
0  0  2  0 
5  4  3  0 
0  12  8 


(EQ1) 


3.0  Data 


Figure  1  schematically  depicts  (in  2-dimensions)  a  fictitious 
data  set  containing  five  clouds  of  points  designated  by  the 
circular  regions  labelled  ‘A’  through  ‘E’.  The  sum  of  the 
Arabic  numbers  within  the  boundaries  of  each  cloud-circle 
denote  how  many  data  points  it  contains.  Let  us  assume  that 
these  clouds  were  generated  by  randomly  sampling  from 
tri  variate  normal  distributions  with  the  common  variance- 
covariance  (or  correlation)  matrix  specified  above.  In  the 
figure  the  subject’s  classification  of  the  data  points  into  sub¬ 
sets  are  indicated  by  the  polygonal  regions  labelled  with 
Roman  numerals.  The  Arabic  numbers  at  the  intersections  of 
cloud-circles  and  subset-polygons  specify  the  number  of 
data  points  in  a  particular  cloud  which  the  subject  classified 
as  members  of  a  particular  subset.  Thus,  the  subject  classi¬ 
fied  all  5  data  points  from  cloud  ‘A’  into  his  subset  T’  along 
with  5  out  of  12  data  points  from  cloud  ‘D’.  The  subject’s 
subset  ‘ir  contained  4  data  points  from  cloud  ‘D’,  1  from 
cloud  ‘E’,  and  6  from  cloud  ‘B’.  Likewise  for  the  remaining 
subsets.  The  information  in  Figure  1  can  be  summarized  by  a 
p  (number  of  clouds)  by  g  (number  of  judged  subsets) 
matrix  C,  which  cross-tabulates  the  number  of  data  points  in 
each  cloud  that  were  placed  into  each  subjective  subset 


Since  VISUALS  displays  objects  in  a  Euclidean  space,  we 
use  measures  based  on  the  relationships  between  data  points 
and  data  clouds  in  Euclidean  space.  Our  measures  concern 
each  cloud’s  compactness  and  the  separation.  Note  that 
these  are  “objective”  measures,  since  they  measure  the 
“true”  characteristics  of  the  point  clouds.  In  the  next  section 
we  define  “subjective”  measures  based  on  the  subject’s  judg¬ 
ments  about  cloud  characteristics. 

Compactness  of  data  cloud  k,  denoted  a^,  is  defined  as 
the  root-mean-squared  distance  between  the  points  in  the 
cloud  and  the  cloud’s  cennoid.  If  equals  the  vector  of 
coordinates  for  the  i’th  data  point  in  the  k’th  data  cloud, 
equals  the  vector  of  coordinates  of  the  centroid  of  the  k’th 
cloud,  and  nj^  equals  the  number  of  data  points  in  the  k’th 
cloud,  then  in  matrix  algebra  form: 


“*  = 


(EQ2) 


In  Figure  1 ,  the  for  any  data  cloud  is  represented  by  the 
radius  of  its  cloud-circle. 


Separation  of  data  clouds  i  and  j  in  the  same  data  set, 
denoted  B . . ,  is  the  distance  between  the  cloud  centroids: 


P;-  7(X  -x/(x.-xp  (EQ3) 

where  and  x.  are  defined  above.  In  Figure  1,  this  is  the 
distance  between  the  cen&oids  of  two  cloud-circles. 


Figure  1 
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5.0  Subjective  Subset  Measures 


Each  of  the  measures  described  above  reflects  “objective” 
characteristics  of  the  data  clouds-  known  only  because  the 
data  were  artiflcially  generated.  However,  in  order  to  fully 
understand  what  information  the  subject  took  into  account  as 
he  formed  his  subsets,  we  must  also  develop  “subjective” 
measures  paralleling  the  “objective”  measures.  These  “sub¬ 
jective”  measures  should  reflect  the  perceived  characteristics 
and  relationships  among  the  clouds. 

Our  “subjective”  measures  use  the  relative  frequency  with 
which  pairs  of  points  in  a  judged  subset  are  correctly  or 
incorrectly  grouped  together.  Notice  that  our  indices  are  not 
based  on  “correctly  or  incorrectly  classified  points".  This  is 
because  as  judged  by  the  subject,  subsets  of  points  cannot  be 
bound  to  any  prior  point  classification  scheme,  making 
notions  of  “correct”  and  “incorrect”  classification  meaning¬ 
less.  However,  after  judgments  are  made,  we  do  know 
whether  two  points  judged  to  be  in  the  same  subset  actually 
do  or  do  not  belong  to  the  same  “objective”  cloud,  allowing 
us  to  define  correctly  or  incorrectly  co-classified  pairs  of 
points. 

Even  though  our  measures  concern  the  subjective  judg¬ 
ments,  they  are  measures  about  the  objective  clouds:  They 
measure  how  the  objective  clouds  are  subjectively  per¬ 
ceived.  Our  two  measures  are  the  perceived  cohesiveness  of 
each  cloud  (the  subjective  analog  of  compactness),  and  the 
perceived  distinctiveness  of  pairs  of  clouds  (the  subjective 
analog  of  separation). 

Cohesiveness  of  cloud  k,  denoted  a^,  is  defined  as  a  func¬ 
tion  of  the  square-root  of  the  frequency  with  which  points  in 
one  cloud  are  correctly  judged  to  belong  with  other  data 
points  in  the  same  cloud,  summed  over  all  subsets,  and 
divided  by  the  maximum  possible  number  of  correct  co¬ 
occurrences  of  the  cloud  members.  If  C  is  the  classification 
matrix  described  above,  and  M  is  the  diagonal  matrix  formed 
from  its’  row  martingales,  M  =  diag  [Cl] ,  (7  being  a  col¬ 
umn  of  g  ones),  then  (unadjusted)  cohesivene.ss  is; 

A  =  diagllir^CC’M~'\  (EQ4) 

The  diagonal  matrix  A  contains  the  (unadjusted)  cohesive¬ 
ness  fli,  of  the  i’th  cloud  on  its  diagonal.  We  call  ii,,  “unad¬ 
justed”  because  its’  lower  bound,  in  this  study,  is 


since  the  subject  knew  that  data  sets  contained  at  most  six 
clouds  (the  upward  bound  is  1).  Thus,  our  (adjusted)  cohe¬ 
siveness  measure,  which  ranges  between  0  and  1,  is: 

ai=  (EG  6) 

Distinctiveness  of  a  pair  of  clouds  i  and  j,  denoted  is 
defined  as  a  function  of  the  (ij)’th  off-diagonal  element  of 

the  matrix  (Af'cC'W')  introduced  above.  In  general,  the 
off-diagonal  elements  measure  the  degree  to  which  points 
from  different  clouds  are  incorrectly  paired  with  one 
another,  summed  over  all  subsets;  They  are  an  index  of  the 
degree  to  which  elements  from  two  clouds  are  “confused” 
with  one  another.  However,  these  indices  are  confounded 
with  the  measure  of  cohesiveness  defined  above.  Thus,  we 

define  the  matrix  M  =  [diag  (CC)  ]  Then  an  index  of 
confusabiUty  of  a  pair  of  clouds  corrected  for  the  cohesive¬ 
ness  of  the  pair  of  clouds  is  given  by  the  off-diagonal  ele¬ 
ments  of  M  'cC'M  '.  Since  this  confusability  measure  is 
inversely  related  to  the  distance  between  two  data  clouds,  we 
define  the  distinctiveness  (which  is  directly  related  to 
distance)  as  the  off-diagonal  elements  of 

B=\V-M^CCm\  (EOT) 

Note  that  varies  from  0,  for  complete  perceptual  confu¬ 
sion  of  point-clouds  i  and  j,  to  1 ,  for  complete  perceptual  dis¬ 
tinction  of  the  two  clouds. 

6.0  Results 


As  stated  earlier,  we  expect  that  the  success  of  the  subject  in 
the  task  posed  by  our  experiment  will  depend  on  the  separa¬ 
tion  of  the  point-clouds  and  on  their  compactness. 

We  expected  that  the  “objective”  compactness  of  data 
clouds,  as  measured  by  ,  would  be  positively  related  to  the 
“subjective”  cohesiveness,  as  measured  by  a- .  However,  the 
observed  correlation  between  a.  and  a ■  was  essentially  zero, 
indicating  that  the  size  of  the  point-cloud  had  no  effect  on 
the  accuracy  with  which  the  subject  grouped  points.  This 
result  may  be  due  to  the  fact  that  all  point-clouds  were  gener¬ 
ated  by  sampling  from  populations  possessing  the  same  vari¬ 
ance-covariance  structure. 


c.  = 


r(w+  1)^  !  (6-r)w^ 

- - ;  where 


r=  Rem[nyf)] 
vv=  Int  [n./6] 


(EG  5) 


We  also  expected  that  the  “objeetive”  separation  of  pairs  of 
data  clouds,  as  quantified  by  the  separation  measure 
would  be  positively  related  to  the  “subjective”  distinctive- 
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ness,  as  measured  by  the  distinctiveness  values  The 
observed  correlation  between  and  b-^  was  .57,  indicating 
that  generally,  the  farther  point-clouds  were  positioned  apart, 
the  more  accurately  the  subject  grouped  points  from  within 
them.The  scatterplot  of  the  relationship  between  distinctive¬ 
ness  and  separation  (Figure  2)  reveals  a  nonlinear  relation- 


<v 

CL 

9) 
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Distinctiveness 

Figure  2 

ship  over  the  range  of  data  we  have  considered.  Closer  study 
shows  that  this  nonlinearity  results  from  a  trend  in  which 
clouds  separated  by  distances  greater  than  12  were,  essen¬ 
tially,  perfectly  distinguishable;  clouds  separated  by  8  to  12 
distance  units  show  distinctivenesses  between  .85  and  1 
(with  2  exceptions);  while  clouds  separated  by  less  than  8 
units  reveal  a  very  mildly  positive  relationship.  A  three-piece 
linear  spline  has  been  drawn  by  eye  for  emphasis.  (Note  that 
a  separation  index  normalized  by  compactness  correlated  .56 
with  distinctiveness  and  showed  the  same  scaterplot  shape). 


Last,  we  expected  that  clouds  containing  many  data  points 
would  be  more  accurately  identified  than  clouds  containing 
few  points.  Both  subjective  measures,  cohesiveness  and  dis¬ 
tinctiveness  bear  on  the  question  of  accuracy.  Cohesiveness 
describes  how  well  data  points  which  belong  together  are 
kept  together,  while  distinctiveness  describes  how  well 
points  that  belonged  apart  are  kept  apart.  The  correlation 
between  cohesiveness,  a-,  and  number  of  data  points  per 
cloud  equaled  .24,  suggesting  a  very  modest  tendency  for 
large  clouds  to  have  more  of  their  observations  correctly 
grouped  together  than  small  clouds.  This  result  must  be 
viewed  as  inconclusive,  however,  due  to  the  potential  corre¬ 
lation  induced  between  cohesiveness  and  cloud  size  during 
the  scaling  of  a.  onto  the  unit  interval.  Finally,  no  relation 
was  seen  between  distinctiveness  and  cloud  size  (using  the 
average  number  of  points  in  the  two  clouds  under  consider¬ 
ation),  indicating  that  the  average  size  of  pairs  of  data  clouds 
did  not  affect  the  accuracy  with  which  data  points  from  the 
two  clouds  were  placed  into  separate  subsets. 


We  also  considered  the  effect  of  cloud  size  on  the  relation¬ 
ship  shown  in  Figure  2.  We  discovered  average  cloud  sizes 
of  all  types  (small,  medium,  and  large)  represented  in  all 
regions  of  the  scatterplot.  Thus,  no  matter  how  many  points 
they  contained,  clouds  which  were  close  together  were  not 
distinguished  as  well  as  clouds  which  were  far  apart.  Fur¬ 
thermore,  among  all  clouds  which  were  far  apart,  the  sub¬ 
ject’s  performance  was  excellent,  across  all  cloud  sizes. 


7.0  Discussion 


TTie  results  of  the  current  investigation  are  “tantalizing,  but 
preliminary”.  The  intuitively  appealing  notion  of  a  relation¬ 
ship  between  the  distinctiveness  and  separation  of  clouds 
appears  to  have  been  borne  out  (albeit  in  a  nonlinear  fash¬ 
ion).  No  firm  conclusions  regarding  the  relationship  between 
data  cloud  size  and  accuracy  may  be  specified;  although  it 
would  appear  that  cloud  size,  if  it  does  have  an  effect,  may 
exert  it  on  a  subject’s  ability  to  refrain  form  incorrectly  frag¬ 
menting  data  clouds,  rather  than  on  his  ability  to  differentiate 
between  points  belonging  to  two  different  clouds. 

In  the  future,  studies  should  vary  the  variance-covariance 
structure  of  the  populations  from  which  the  cloud  points  are 
sampled  in  order  to  investigate  effects  on  perceived  cloud 
cohesiveness.  Second,  our  results  provide  rough  guidelines 
for  redefining  “interesting”  ranges  of  distances  to  examine  in 
more  detail  (at  least  for  the  variance-covariance  suaicture  we 
used).  Third,  future  studies  should  contain  more  heteroge¬ 
neous  mixtures  of  cloud  sizes  than  we  used.  Finally,  our 
“subjective”  cloud  measures  belong  to  a  class  of  covariance- 
type  measures;  while  our  “objective”  measures  are  distances. 
While  comparison  between  these  two  classes  of  measures 
should  correctly  point  out  relationships  between  “subjective” 
and  “objective”  information  where  such  relationships  exist, 
they  will  not  necessarily  follow  any  “nice”  functional  form 
(c.f.  Figure  2).  Consequently,  we  are  actively  exploring  ways 
to  define  distance-like  “subjective”  measures. 
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1  Abstract 

Exploring  multivariate  data  with  the  grand  tourfl] 
is  a  visually  exciting  way  to  discover  interesting  struc¬ 
ture.  However,  one  criticism  of  this  method  is  that  as 
dimensionality  increases  the  chances  of  quickly  discov¬ 
ering  views  of  interest  diminish  rapidly,  because  of  the 
random  nature  of  the  grand  tour,  and  the  expanding 
volume  of  space. 

To  improve  the  chances  of  discovering  interesting 
structure  we  propose  a  method  for  controlling  the  explo¬ 
ration  by  motion  control  and  directing  movement  along 
the  gradient  of  a  projection  pursuit  function. 

The  benefits  of  this  approach  are  two-fold.  Firstly,  it 
provides  a  fast,  powerful  exploratory  data  analysis  tool, 
and  secondly,  it  provides  a  vehicle  for  exploring  and  com¬ 
paring  projection  pursuit  functions. 

2  Introduction 

Suppose  that  we  are  in  a  two-dimensional  world  in 
a  higher-dimensional  universe,  and  suppose  that  despite 
this  handicap  we  are  interested  in  exploring  our  rmiverse 
via  sequences  of  two-dimensional  views.  Essentially  this 
is  the  environment  of  the  grand  tour. 

Our  implementation  of  the  grand  tour,  coined  the  ran¬ 
dom  jump  walk  tour,  is  one  in  which  a  starting  plane  is 
fixed,  an  ending  plane  is  generated  randomly  in  the  p- 
dimensional  data  space,  and  the  jump  walk  tour  path 
is  the  geodesic  interpolation  between  the  two.  When 
the  ending  plane  is  reached  it  becomes  the  new  starting 
plane,  a  new  ending  plane  is  randomly  generated,  and 
the  tour  progresses,  essentially  randomly  walking  on  a 
Grassmann  manifold  in  p-space[2]. 

To  begin  providing  control  in  the  tour  we  first  look 
at  controlling  the  jump  size.  We  define  the  jump  size  to 
be  the  distance  between  the  starting  and  ending  planes, 
that  is,  the  norm  of  the  canonical  angles.  In  a  random 
jump  walk  tour  this  jump  size  fluctuates  randomly,  in  an 
unrestricted  manner.  If  we  provide  control  over  the  jump 
size,  by  restricting  it  to  be  small,  we  keep  the  exploration 
local,  whilst  increasing  the  jump  size  allows  more  global 
movement  over  the  space. 

In  addition  we  assign  an  index  of  interest  to  each  two- 
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Figure  1:  Jump  random  walk  tour  with  indices  of 
interest  plotted  over  time. 

dimensional  view  of  the  data.  The  meaning  of  interest 
varies  somewhat  in  data  analysis.  For  example,  Fried¬ 
man  and  Tukey[3]  thought  local  dumpiness  was  interest¬ 
ing  but  later  Priedman[4]  associated  interest  with  non¬ 
normality. 

A  projection  pursuit  function  is  used  to  assign  an  in¬ 
dex  of  interest  to  each  two-dimensional  projection  of 
the  data.  By  choosing  a  smooth  function  the  deriva¬ 
tive  can  be  used  to  determine  the  new  motion  direction, 
as  opposed  to  randomly  generating  directions  in  the  un¬ 
restricted  tour. 

Combining  this  with  jump  size  control  means  that  we 
try  to  direct  the  tour  towards  views  with  higher  indices 
of  interest,  and  thus  hopefully,  views  that  expose  the 
structure  in  the  data. 

3  Projection  Pursuit  Indices 

There  are  four  indices  of  two  basic  types  which  we 
have  currently  implemented;  two  indices  based  on  ex¬ 
pansions,  and  two  indices  based  on  density  estimates. 
For  two-dimensional  projection  pursuit  it  is  usual  to  ini¬ 
tially  sphere  the  data,  either  by  princip^d  components 
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or  using  a  robust  variance-covariance  matrix.  Here  the 
following  notation  is  used: 

Zi  =  data  vector;  i  =  1,  ...,n 
q,/3  =  projection  vectors 
xu-a'zi,  X2i  =  (3'zi 

yii  =  2$(sii)  -  1,  y2i  =  2<&(a:2.)  -  1 

3.1  Polynomial  Indices 

Friedman’s[4]  index  is  the  i2-distance  between  the 
function  g,  obtained  by  inverting  the  empirical  density 
through  a  standard  normal  cdf,  and  a  bivariate  uniform 
density  on  [—1,1]  x  [—1, 1],  The  empirical  density  ex¬ 
panded  in  T'^c...ents  by  Legendre  polynomials  as  follows: 

I(a,f3)  -  J  ^{g{yi,y2)-^Vdyidy2 

..  oo  oo 

-I- 1)(21:  +  1)  X 

j=0k=0 

E^Pj{y,)P,{y2)}  -  i 

where  ^ 

Po{y)  =  1,  Pi{y)  =  y, 

Pj{y)  =  )m  -  i)yPi-i{y)  -  O'  -  i)^’2-2(j/)] 

are  the  Legendre  polynomials. 

In  response,  Hall[5]  suggested  that  Friedman’s  index 
is  not  useful  for  heavy  tailed  distributions,  by  showing 
that  this  index  will  be  infinite  if  the  tails  of  the  distri¬ 
bution  do  not  decrease  at  least  as  fast  as  exp{— x^/4}. 
As  an  alternative  he  proposed  an  index  based  on  the  L2 
distance  of  the  empirical  density,  g,  from  a  standeird  nor¬ 
mal  density,  with  the  expansion  of  the  empirical  density 
obtained  by  Hermite  polynomials.  Our  bivariate  version 
of  this  approach  is  ais  follows: 


/OO  y*O0 

I  2:2)  -  l2)}^dXidX2 

•OO  */  — OO 

00  00 

=  EE  E‘^{hj{xi)hk{x2)} 

j=0k=0 

-2Tr~^E{ho{xi)ho{x2)}  + 

hi{x)  =  Hi{x)4>{x) 

and 

fl'o(x)  =  1,  Hi{x)  =  X 
Hi{x)  =  xfr._i(x)  -  (i  -  l)ff._2(x) 
are  the  standardized  Hermite  polynomials[ll]. 

Both  of  these  indices  are  estimated  by  truncating  the 
sum  at  some  finite  number  (Friedman[4]  suggests  be¬ 
tween  4  and  8  for  his  index),  and  estimating  the  expected 
values  by  sample  means. 

3.2  Density  Estimate  Indices 

Friedman  and  Tukey[3]  originally  proposed  an  index 
based  on  a  local  scale  measure  multiplied  by  a  local  den¬ 
sity  estimate  designed  to  search  for  dumpiness  in  the 


data.  We  have  implemented  a  bivariate  adaptation  of 
this  index  based  on  the  L2-norin  of  a  local  der‘'ity  esti¬ 
mate.  Because  we  initially  sphered  the  data,  we  disre¬ 
gard  the  local  scale  measure,  and  have: 

7(a,/3)  =  j  /^(x)dx 

where  /  is  estimated  by  a  kernel  density  estimate 

1=1 

A-(x)=l  ifx'x<l 

I  0  otherwise 

The  kernel  is  one  that  is  proposed  by  Silverman  [9]  for 
bivariate  density  estimation,  because  of  its  differentia¬ 
bility  properties.  Normally  in  calculating  a  density  esti¬ 
mate  one  would  optimize  the  window  width  parameter, 
however  this  would  be  impractical  to  do  for  each  two- 
dimensional  projection,  so  we  set  a  value  and  allow  the 
user  interactive  control  of  this. 

The  fourth  index  we  consider  is  negative  entropy 
which  is  very  similar  to  the  Friedman- Tukey  index,  in 
that  it  is  based  on  a  local  density  estimate,  as  above: 

I(a,/3)  =  -  J  f(x)\ogf(x)dx 

This  index  is  discussed  by  Jones  and  Sibson[7],  and  Hu- 
ber[6]. 

4  Implementation 

The  implementation  of  projection  pursuit  is  embed¬ 
ded  into  XGobi,  the  dynamic  graphics  program  under 
development  by  Swayne,  Cook  and  Buja[10].  Figure  2 
gives  an  indication  of  the  setup. 

An  XGobi  window  is  initiated  with  the  data  of  inter¬ 
est.  Tour  mode  is  activated.  The  top  plot  window  has 
the  data  dynamically  touring.  When  ProjPursuit  is  se¬ 
lected  the  bottom  window  pops  up.  In  this  window  the 
projection  index  is  plotted  over  time.  The  current  value 
of  the  index  is  also  printed  in  a  small  window  beside  the 
projection  pursuit  button. 

In  addition  the  data  is  sphered  by  principal  compo¬ 
nents,  as  indicated  by  the  PrnCmp  Basis  button  being 
highlighted,  and  the  variable  labels  become  PCI,  PC2, 
etc. 

Projection  pursuit  can  be  either  Active  or  Passive. 
In  active  mode  the  direction  of  movement  is  determined 
by  the  derivatives  of  the  projection  pursuit  function 
whilst  in  passive  mode  the  tour  reverts  to  the  random 
jump  walk  but  we  still  get  the  indices  plotted  over  time. 

Beside  the  Active  button  is  one  labelled  Bitmap. 
Clicking  on  this  generates  a  smaJl  picture  in  the  bottom 
window  of  the  view  in  the  top  window  at  the  time.  As 
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Figure  2:  Implementation  of  direction  and  motion 
control 

we  can  see  the  view  corresponding  to  the  first  local  max¬ 
imum  is  one  in  which  three  distinct  clusters  within  the 
data  are  separated. 

The  scrollbar  below  the  active  allows  the  user  to  con¬ 
trol  the  tour  jump  size  during  active  projection  pursuit. 
The  effects  of  this  are  seen  in  the  bottom  window.  Ev¬ 
ery  time  a  direction  change  is  made  a  small  vertical  bar 
is  drawn  on  the  horizontal  axis.  When  the  jump  size  is 
small  the  tour  sharpens  up  the  nearest  local  maximum 
and  when  large  the  tour  moves  globally  over  the  Grass- 
mann  manifold  in  p-space. 

The  next  scrollbar  controls  the  number  of  terms  in  the 
expansion  of  the  polynomial  indices,  and  also  allows  the 
window  width  to  be  adjusted  for  the  density  estimates. 

Lastly  there  is  a  menu  for  selecting  a  projection  pur¬ 
suit  index. 

5  Examples 

In  this  section  we  are  looking  at  the  uses  of  this 
methodology  in  exploring  data,  and  comparing  projec¬ 
tion  indices.  For  this  purpose  we  show  window  dumps 
in  figures  3,  4  and  5  of  the  bottom  time  series  window 
of  the  progress  of  the  tour  guided  by  projection  pursuit 
over  time.  Every  time  a  new  greidient  is  calculated  a  bar 
is  drawn  on  the  horizontal  2ixis.  Keep  in  mind  that  in 
the  setup  on  a  workstation  this  is  happening  dynami¬ 
cally,  and  we  see  the  data  touring  simultaneously  in  the 
top  window. 

The  data  in  figures  3  and  4  comes  from  Lubischew[8]. 
It  consists  of  6  measurements  on  3  species  of  flea-beetles, 
with  a  total  of  74  cases.  The  window  dumps  in  figure 
3  are  brief  sessions  wit-  Friedman’s  index  and  Hall’s  in¬ 
dex,  respectively.  Figure  4  has  window  dumps  of  the 
Friedman-Tukey  and  entropy  indices. 


Figure  3:  Comparison  of  the  polynomial  indices  on 
flea-beetle  data. 

In  each  case  the  starting  point  is  the  view  given  by  the 
first  two  principal  components.  We  see  that  this  is  not  a 
very  discriminating  view  of  the  three  species.  However 
upon  activating  projection  pursuit  guidance  we  see  that 
all  the  indices  immediately  move  the  tour  into  a  view 
of  a  three  group  separation,  so  with  very  little  work  we 
have  discovered  a  very  informative  picture  of  the  data. 

In  figure  5,  a  comparison  of  the  poiynomieil  indices  is 
illustrated  on  a  nine-dimensioned  hypercube.  The  views 
that  distinguish  the  hypercube  based  on  two-dimensional 
projections  are  ones  where  the  data  collapse  into  the  four 
vertices  of  a  square.  During  an  unrestrained  random 
jump  walk  tour,  most  views  of  the  hypercube  appeeir 
close  to  being  bivariate  normal,  aside  from  the  interfer¬ 
ence  patterns.  It  is  virtually  impossible  to  see  a  complete 
collapse  into  the  four  point  view. 

So  it  is  interesting  that  starting  at  an  arbitrary  view, 
in  this  data,  the  projection  pursuit  directed  tour  very 
quickly  finds  a  view  of  the  data  collapsed  into  a  square, 
for  both  of  the  two  polynomial  indices.  The  one  differ¬ 
ence  between  the  two  indices  is  that  Hall’s  index  seems 
to  need  to  do  less  work  to  find  this  four  point  view, 
whilst  Friedman’s  index  needs  to  wend  its  way  through 
some  lower  level  maxima.  (The  apparent  planar  non- 
equi variance  of  Hall’s  index  is  due  to  the  truncation  of 
the  infinite  sum.) 

6  Conclusions 

We  have  found  that  the  polynomial  indices  show  the 
most  promise  cue  to  their  speed  of  computation.  The 
computation  of  these  indices  is  of  order  n,  as  opposed 
to  order  for  the  density  estimate  indices.  In  practice. 
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Figure  4:  Comparison  of  the  density  estimation  in¬ 
dices  on  flea-beetle  data. 

Friedman’s  index  doesn’t  appear  to  be  overly  sensitive 
to  outliers  and  heavy  tails  as  Hall  suggested,  but  this 
doesn’t  mean  that  both  indices  behave  identically,  as 
we  saw  in  the  last  section.  However,  other  than  non¬ 
normality  it  is  difficult  to  quantify  what  it  is  that  the 
indices  are  actually  searching  for. 

Quite  readily,  beginning  at  the  view  given  by  the  first 
two  principal  components,  our  guidance  system  finds  in¬ 
teresting  structure,  but  the  inherent  optimization  prob¬ 
lems  with  noisy  functions  arise.  Using  simple  derivative- 
based  direction  control  doesn’t  assist  in  finding  tight  lo¬ 
cal  peaks  and  creates  problems  when  the  function  con¬ 
sists  of  long  trenches  or  ridges.  These  structures  tend 
to  be  more  common  in  as  dimensionality  incrccises.  We 
counter  some  of  the  problems  by  switching  to  passive 
mode  and  allowing  the  random  jump  walk  tour  to  move 
over  the  space  before  beginning  active  projection  pursuit 
again. 

Whilst  it  is  simple  to  sphere  the  data  by  principal 
components,  it  is  not  ideal,  and  so  our  next  question 
will  be  to  explore  these  methods  with  robust  sphering. 

Despite  the  problems  we  have  encountered,  we  have 
devised  a  tool  which  readily  allows  a  comparison  and 
development  of  indices,  as  well  as  providing  direction 
and  motion  control  in  the  grand  tour  to  increase  the 
chances  of  discovering  structure  when  exploring  data. 
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Abstract 

In  this  paper  we  develop  an  approach  for  creating  a 
graphical  user  interface  (GUI)  for  an  existing 
command  driven  mainframe  program.  As  an  exam¬ 
ple  of  this  approach,  we  present  an  implementation 
using  the  SCA  Statistical  System.  The  windows 
front-end  to  the  SCA  System  runs  on  a  personal 
computer  using  Microsoft  Windows.  The  front-end 
communicates  with  the  SCA  System  running  on  a 
mainframe  or  workstation  through  a  serial  commu¬ 
nication  device.  This  implementation  demonstrates 
the  advantage  of  using  such  an  approach. 

1.  Introduction 

A  general  goal  in  the  computer  industry  has  been 
towards  making  computers  and  software  more  user 
friendly.  A  recent  trend  towards  reaching  this  goal 
is  through  the  use  of  a  graphical  user  interface 
(GUI).  A  common  problem  with  GUIs  is  that  they 
either  favor  novice  users  over  experts,  or  favor 
expert  users  over  novices.  In  this  paper,  we  present 
an  approach  for  combining  command  and  graphical 
user  interfaces  into  a  single  user  interface  that 
benefits  both  novice  and  expert  users.  We  will  refer 
to  the  user  interface  presented  in  this  paper  as  the 
composite  user  interface  (CUl). 

We  have  implemented  this  approach  for  the  SCA 
Statistical  System  and  later  we  will  discuss  certain 
features  that  have  resulted  from  our  design 
approach.  The  SCA  Statistical  System  consists  of 
three  packages  which  provide  capabilities  for  fore¬ 
casting  and  time  series  analysis  (Liu  et  al.  1986), 
quality  and  producitivity  improvement  using  sta¬ 
tistical  methods  (Liu  et  al.  1987),  and  general  sta¬ 
tistical  analysis  (Hudak  et  al.  1989).  The  graphical 
front-end  to  the  SCA  System  runs  on  a  personal 
computer  using  Microsoft  Windows.  The  front-end 
communicates  with  the  SCA  System  running  on  a 
host  computer  through  a  serial  communication 
device.  When  the  mainframe  SCA  System  and  the 
SCA  Windows/Graphics  Package  (Liu  et  al.  1991)  are 
used  together,  a  complete  windowing  environment  is 


created  for  the  user  without  any  modification  to  the 
existing  SCA  System. 

2.  Design  and  Philosophy  of  the  Composite 
User  Interface 

A  primary  goal  of  our  user  interface  design  is  to 
allow  graphical  and  command  user  interfaces  to  co¬ 
exist  in  the  same  application  software.  In  addition, 
we  wr  it  the  user  interface  portion  of  the  software  to 
be  as  independent  of  the  computational  portion  of 
the  software  as  possible,  and  ideally  the  same  user 
interface  program  is  able  to  function  with  different 
versions  of  the  computational  portion  of  the  program 
for  different  computers  and  operating  systems. 
These  are  the  key  emphases  of  “system-indepen¬ 
dence”  in  our  user  interface  design.  In  trying  to 
fulfill  these  goals  we  have  developed  a  hybrid  user 
interface,  that  we  refer  to  as  the  composite  user 
interface.  Below  we  outline  the  basic  features  that 
comprise  the  composite  user  interface  presented  in 
this  paper: 

1.  The  functionality  of  the  program  is  independent 
from  the  user  interface.  To  facilitate  this  sepa¬ 
ration,  we  use  two  separate  programs:  a  computa¬ 
tional  program  and  a  front-end  program.  We 
will  refer  to  this  separation  of  the  front-end 
program  from  the  computational  program  as  the 
segmented  feature  of  the  interface. 

2.  The  computational  program  provides  a  command 
user  interface  which  is  the  same  across  different 
host  operating  systems. 

3.  The  front-end  program  provides  a  graphical  user 
interface  which  conforms  to  a  native  GUI  envi¬ 
ronment.  The  user  accesses  the  computational 
program  through  the  use  of  dialog  boxes,  and 
other  GUI  devices.  These  graphical  objects  will 
generate  syntactically  correct  commands  for  the 
computational  program. 

4.  The  front-end  program  should  also  provide  a 
command  window  which  preserves  a  command 
user  interface  to  the  computational  program  and 
bypass  the  graphical  objects  of  the  front-end 
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program.  We  will  refer  to  the  support  of 
graphical  and  command  user  interfaces  in  one 
progran.  <  the  dual  feature  of  the  interface. 

5.  The  user  si.ould  be  able  to  customize  the  envi¬ 
ronment  and  use  both  graphical  objects  and 
commands  interchangeably  in  the  same  session. 

The  primary  purpose  of  the  composite  user  interface 
is  to  support  an  interface  in  which  graphical  and 
command  user  interfaces  can  be  integrated.  Such  a 
dual  feature  is  very  desirable.  We  envision  that 
software  development  is  an  evolutionary  process,  and 
the  extension  of  graphical  user  interfaces  is  part  of 
this  process.  Since  the  currer.t  base  of  users  have 
already  made  an  extensive  investment  in  the  current 
command  language,  we  see  the  need  to  preserve  the 
command  language  as  it  currently  exists. 

The  need  for  a  segmented  approach  is  also  driven 
by  the  requirements  of  system-independent,  mod¬ 
ularity,  and  portability.  The  most  plausible  ap¬ 
proach  to  implement  this  feature  is  to  develop  the 
front-end  interface  as  a  separate  program. 

The  approach  outlined  above  will  result  in  more 
portable  code,  since  the  functionality  of  the  program 
does  not  depend  on  the  GUI.  Also  since  command 
programs  are  usually  less  machine  dependent,  it 
should  be  fairly  easy  to  move  the  command  program 
to  new  environments.  In  addition,  if  the  new  envi¬ 
ronment  does  not  support  a  GUI,  the  command  pro¬ 
gram  is  still  a  viable  program  in  its  own  right. 

3.  An  Implementation  Using  the  SCA  System 

In  general,  the  approach  outlined  in  Section  2  can  be 
applied  to  any  existing  command  program.  In  this 
section,  we  illustrate  this  approach  using  the  SCA 
Statistical  System.  In  this  implementation,  the  front- 
end  program  runs  on  an  IBM  compatible  personal 
computer  running  Microsoft  Windows.  We  have 
tested  the  front-end  program  with  the  SCA  System 
which  runs  on  IBM/TSO.  IBM/CMS,  VAX/VMS,  or 
UNIX  operating  systems.  This  software  is  currently 
available  from  SCA  as  the  SCA  Windows/Graphics 
Package  (Liu  et  al.  1991 ),  which  we  will  also  refer  to 
as  the  SCAWIN  program  in  this  paper. 

3.1  The  SCA  Statistical  System 

Even  though  the  proposed  approach  theoretically  can 
be  employed  in  any  command  driven  software,  its 
effectiveness  and  implementation  depends  on  the 
command  structure  of  the  software.  If  the  software 
has  a  rather  simple  command  structure,  the  GUI  will 
be  of  less  benefit.  The  SCA  System  has  a  fairly 


extensive  syntactical  structure,  which  includes 
modifiers  to  each  command.  This  makes  a  GUI 
extremely  beneficial  to  users  who  are  not  familiar 
with  ail  the  features  of  the  SCA  System.  In  this 
section,  we  employ  a  set  of  data  from  Box,  Hunter, 
and  Hunter  (1978)  to  illustrate  the  command  user 
interface  of  the  SCA  System.  The  data  employed 
were  the  results  of  a  chemical  experiment.  In  this 
experiment,  it  was  believed  the  initial  rate  of  the 
formation  of  a  chemical  impurity  causing  a  discolor¬ 
ation  is  linearly  depedent  on  the  concentrations  of 
monomer  and  dimer.  The  rate  is  zero  when  both 
components  are  absent.  The  data  are  entered  direct¬ 
ly  and  stored  in  the  SCA  workspace  in  the  variables 
IMPURITY,  MONOMTER,  and  DIMER.  A  regres¬ 
sion  analysis  with  zero  intercept  (i.e.  no  constant 
term)  is  then  performed.  The  SCA  commands  to 
perform  the  above  analysis  is  listed  below: 

INPUT  VARIABLES  ARE  IMPURITY,  MONOMER,  DIMER. 

5.75  0.34  0.73 

4.79  0.34  0.73 
5.44  0.58  0.69 

9.09  1.26  0.97 
8.59  1.26  0.97 

5.09  1.82  0.46 
END  OF  DATA 

REGRESS  VARIABLES  ARE  IMPURITY,  MONOMER,  DIMER.  3 
NO  CONSTANT. 

STOP 

In  the  above  SCA  session,  we  have  executed  three 
SCA  commands:  INPUT,  REGRESS,  and  STOP.  The 
function  and  syntax  of  these  commands  are  illus¬ 
trative  of  SCA  command  syntax.  An  SCA  command 
is  also  referred  to  as  a  paragraph.  The  first  word  of 
the  command  is  called  the  paragraph  name.  In  this 
example,  INPUT,  REGRESS,  and  STOP  are  para¬ 
graph  names.  The  paragraph  name  is  followed  by 
various  modifiers  to  the  command,  the  modifiers  are 
referred  to  as  sentences.  In  the  REGRESS  para¬ 
graph  there  are  two  sentences:  “VARIABLES  ...” 
and  “NO  CONSTANT”.  Notice  that  sentences  are 
separated  by  the  delimiter  period  (“.”). 

3.2  The  SCA  Windows/Graphics  Program 

Here  we  outline  the  features  of  the  SCAWIN  pro¬ 
gram.  To  start  the  SCAWIN  program,  the  user  must 
first  login  to  the  host  computer  using  the  terminal 
emulator  window  included  with  the  SCAWIN  pro¬ 
gram.  After  the  user  executes  the  SCA  System,  the 
user  may  enter  the  SCA  commands  and  data  (shown 
in  Section  3. 1 )  into  the  Command  Window.  Below  is 
a  sample  screen  for  the  analysis  discussed  in  the 
above  section. 
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SCA  Windows  Menu 
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SCA  Commond  WmdoW.  SCAiHtaT.CMD 


Dm A>  y .ov  1 .20  i.v/ 

DA1A>  t.S9  1.26  1.97 
DATA)  9. *9  l.t?  1.06 
DATA)  FNO  OF  DATA 

SCA  >  MLCMLSS  UAKIAHLLS  AKL  1 IVUM M V ,MUHU»t  M .01 1ft  H .  NU  CUHSIANI 
'SCA  >  _ _ _ ..  _ 


u 


I'he  above  display  contains  three  windows.  They  are 
the  SCA  Windows  Menu,  the  SCA  Output  Window', 
and  the  SCA  Command  Window. 

(A)  Command  Window 

The  composite  user  interface  described  in  this  paper 
requires  that  a  user  be  able  to  access  a  command 
program  in  its  native  language.  The  SCA  Command 
Window  allows  the  user  to  communicate  with  the 
SCA  System  in  this  manner.  The  Command  Window 
also  maintains  a  complete  history  of  all  SCA  com¬ 
mands  issued  during  an  SCA  session.  The  commands 
in  this  window  can  be  edited  and  then  executed. 


(B)  Output  U  indow 

The  Output  Window'  contains  all  the  output  from  an 
SCA  session.  All  information  (i.e.  text)  in  the 
Output  Window  can  be  reviewed  at  any  time. 


(C)  SCA  Windows  Menu 

One  of  the  most  important  features  in  the  SCAWIN 
program  is  the  ability  to  access  the  command  lan¬ 
guage  using  graphical  objects.  To  accomplish  this, 
we  have  implemented  pull-down  menus  to  allow 
users  to  access  all  SCA  paragraphs.  The  SCA  Win¬ 
dows  Menu,  shown  below,  allows  the  user  to  create 
SCA  commands  through  dialog  boxes  and  also 
display  help  information  for  SCA  paragraphs. 


^ _ SCA  W.mJuwi  Minu _ _ _ _ 

Ftia  Data  Ditplav  OSA  Anatv*'*  MihIhIio^  IIHiara 


Ten  menu  items  are  displayed  on  the  SCA  menu  bar. 
Each  of  these  items  activate  a  pull-down  menu. 


When  the  user  selects  a  menu,  a  pull-down  menu  ] 

is  generated.  For  example  suppose  a  user  wishes  to  j 

perform  a  regression  analysis,  the  user  would  select  j 

the  “GSA”  item  on  the  SCA  Windows  Menu  and  j 

then  select  the  “Regression  Analysis...”  item.  i 


Box-Cox  Tronslormalion... 
Corretation... 


Cross  Tabulation... 

1  -way  or  2-way  Table... 


2  sample  I  lest.. 

\  way  ANOV/l... 
?-wayAMOVA... 
NwayANOVA... 

ANOVA  Design  Matrix... 


Distribution  Simulation... 


Non-parametric  Statistics.. 


By  selecting  this  item,  a  dialog  box  is  displayed  to  ( 

assist  the  user  to  create  a  command.  In  the  SCAWIN 
program,  this  dialog  box  is  referred  to  as  the  Com¬ 
mand  Builder.  Below  we  describe  the  use  of  the 
Command  Builder. 


(D)  Command  Builder  Window 

The  Command  Builder  Window  is  designed  to 
facilitate  the  construction  and  entry  of  any  SCA 
paragraphs.  As  an  illustration,  we  show  the  short 
Command  Builder  Window  for  the  REGRESS 
paragraph: 


SCA  Command  Builder:  tlLCillLSS 


OuTduT  and  innuT  vntiablf  •>: 

Constant  term  in  model?:  ,l>vFAult  iv  VCS 

Hold  resutHs)  in:  [HfcSI  PUaLS(ras )  .H  H  tD(  ) 
Other  npftnns: 


I  OK  ~)  [  Cancel  ~j  [  Help  ~| 


To  create  an  SCA  command,  the  user  first  enters 
information  into  one  or  more  of  the  controls.  (In 
this  instance,  a  control  refers  to  one  of  the  text  input 
boxes  in  the  above  dialog  box.)  Even  though  many 
controls  may  be  displayed,  the  user  does  not  need  to 
enter  information  for  all  controls.  Information  only 
needs  to  be  provided  for  those  controls  that  corre- 
‘  lond  to  required  SCA  sentences  (required  sentences 
were  discussed  in  Section  3.1  and  are  signified  by 
having  the  prompt  underlined  and  in  red  color). 
When  the  user  has  completed  entering  the  informa¬ 
tion,  he  may  select  the  “OK”  button  or  presses  the 
Enter  key.  The  command  will  then  be  created  and 
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sent  to  the  SCA  System.  The  SCA  command  created 
is  also  displayed  in  the  Command  Window. 

The  Command  Builder  does  not  include  instruc¬ 
tions  for  every  sentence  when  it  creates  an  SCA 
command.  Several  keywords  have  been  employed  to 
indicate  those  sentences  that  need  not  be  processed. 
These  keywords  are  “None”,  “All”,  “Default”,  and 
“e.g.”.  If  an  argument  appearing  in  the  control 
begins  with  one  of  these  keywords,  then  the  Com¬ 
mand  Builder  will  use  the  SCA  System’s  default 
argument  for  that  sentence.  The  keywords  defined 
above  have  been  employed  to  provide  the  user  with 
the  default  options  or  serve  as  sample  information. 

Another  feature  that  has  been  implemented  is  the 
ability  for  the  user  to  control  the  amount  of  options 
presented  in  the  Command  Builder  Window.  If  the 
user  requests  a  full  Command  Builder,  then  more 
optional  sentences  will  be  provided  for  the 
REGRESS  paragraph  as  shown  below; 


SCA  Command  tjuildci:  lltCIItSS 

Output  fnout  y/ntiAbles: 

II 

] 

Constant  term  In  model?: 

jDrfault  ia  VCS 

] 

Span  of  cases  to  use: 

|aii 

] 

Level  for  diagnostic  statistics: 

{Hone 

] 

Compute  Durbln  Watson  statistic?: 

jOefault  is  NO 

3 

Display  fitted  values?: 

jOefault  is  HU 

3 

ANOVA  tables  to  display: 

lOrfauU  is  SCQUCHriAL 

3 

Output  options: 

|e.g.  PRlNr(RC0RR,C0Aff,ESn»MTE 

3 

Hold  resun(s|  in: 

{Klsl0U(il.S(ras),FiiiE0O 

3 

Other  options: 

1 

3 

1  OK  1  1  Cancel  |  |  Help  | 

_ 

(E)  High  Resolution  Graphics 

The  mainframe  SCA  System  does  not  have  high 
resolution  graphics  capabilities  of  its  own,  this  is  due 
to  the  highly  machine  dependent  requirements  of 
high  resolution  graphics.  Using  the  composite  user 
interface,  we  are  able  to  implement  graphics  without 
having  to  change  the  existing  SCA  program.  This  is 
achieved  by  capturing  the  data  on  the  PC  and  then 
displaying  the  graphics  in  the  Graphics  Window. 

4.  Summary  and  Conclusion 

In  this  paper  we  have  outlined  a  composite  user 
interface  for  creating  a  GUI  for  an  existing  com¬ 
mand  program,  without  having  to  modify  the  ex¬ 
isting  command  program.  The  same  front-end  GUI 
program  will  work  with  the  command  program 
under  different  computers  or  operating  systems.  We 


have  also  stressed  the  need  of  the  co-existence  of 
command  and  graphical  user  interfaces.  By  using  an 
interface  with  a  dual  feature,  we  have  demonstrated 
an  approach  that  will  have  benefits  to  different 
levels  of  users. 

The  approach  presented  in  this  paper  not  only 
provides  benefits  to  users,  but  also  to  software 
developers.  Software  developers  do  not  need  to 
rework  their  existing  programs,  but  instead  can 
concentrate  on  the  GUI  front-end.  By  retaining  the 
native  command  language  and  minimizing  the 
changes  to  the  existing  code,  costs  f or  documentation 
and  future  software  maintenance  are  reduced 
(Boehm  and  Papaccio  1988).  The  composite  user 
interface  has  demonstrated  to  be  a  cost  effective 
means  for  both  users  and  software  developers  for 
migrating  from  character-oriented  to  graphical 
environments. 
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Mankind  is  returning  fossil  fuel  generated  CO2  to 
Earth’s  atmosphere  at  an  exponential  rate,  causing  con¬ 
cern  about  a  greenhouse  warming.  Jones,  et.al.  (1986) 
derived  the  record  of  yearly  average  temperature  changes 
plotted  in  Fig.  1.  The  least  squares  straight  line  has 
slope  0.38±  0.04  (®C)(century)“h  but  the  average  slope 
since  1970  has  been  much  greater  and  is  thought  by  some 
to  indicate  the  onset  of  the  greenhouse. 

In  Fig.  2  the  circles  represent  annual  global  totals  of 
fossil  fuel  production  for  1870-1986  [Boden  (1988)].  The 
dashed  curve  is  a  nonlinear  least  squares  fit  of  the  model 


-^  =  aP,  P(0)  =  Po  =>  P(<)  =  Poexp(oO  • 


The  fitting  parameters  together  with  the  sum  of  squared 
residuals  (SSR)  are  given  in  the  first  row  of  the  table  on 
the  next  page. 

Rust  and  Kirk  (1982)  showed  that  for  1870-1974,  the 
exponential  growth  of  fossil  fuel  production  was  mod¬ 
ulated  inversely  by  Northern  Hemisphere  temperature 
variations.  If  P{t)  is  fossil  fuel  production  at  year  t  and 
T{i)  is  temperature,  then  their  model  is  written 


dt 


P(0)  =  Po  , 


where  /?  is  an  additional  fitting  parameter.  Using  an  ear¬ 
lier,  cruder  temperature  record,  they  obtained  the  values 
given  in  the  second  row  of  the  table. 

A  more  realistic  model,  allowing  for  time  lags  between 
temperature  changes  and  the  corresponding  reponses  in 
production,  can  be  written 


where  t”  =  t'  —  t  is  the  time  lag,  w{t",  t)  is  a  memory 
function  satisfying 


and  r  is  a  parameter  measuring  the  rate  at  which  w(t",  r) 
tapers  to  zero  for  decreasing  values  of  the  time  lag. 

One  way  to  specify  w{t",  r)  is  to  assume  a  functional 
form  in  which  r  becomes  the  fourth  fitting  parameter. 
We  tried  the  following:  the  boxcar  function, 

wit",  t)  =  -  ,  -r  <  t"  <  0  , 

T 

the  triangle  function, 

«,(<", r)  =  1 -h  ,  -r<<"<0, 

and  the  half-Gaussian, 


wit",  r)  =  — =  exp 


-00  <  t"  <  0  . 


We  calculated  the  convolution  integrals  numerically,  us¬ 
ing  for  T(<)  a  cubic  interpolating  spline  representation 
of  the  temperature  data  (shown  as  the  curve  connect¬ 
ing  the  points  in  Fig.  1).  The  fitted  parameter  values 
are  given  in  rows  3,  4  and  5  of  the  table,  and  the  corre¬ 
sponding  memory  function  estimates  are  plotted  in  Fig. 
3.  The  half-Gaussian  window  gave  the  best  fit,  with  an 
estimated  P(<)  very  similar  to  the  solid  curve  in  Fig.  2. 

Another  way  to  specify  u?(<"-,  t)  is  to  estimate  it  from 
the  data.  We  did  this  by  assuming  that  te(<";  r)  =  0  for 
all  lags  with  magnitude  greater  than  r  =  n  -|- 1,  where  n 
is  a  prespecified  integer,  and  approximating  the  convo¬ 
lution  integral  by  numerical  quadrature,  i.e., 


-  i) 


j=0 


P  , 


where  the  w,  are  quadrature  coefficients,  and  wj  = 
«;(— j,  r)  are  discrete  values  of  the  memory  function  to 
be  estimated.  The  solution  of  this  ODE  can  be  written 


ot-5^/3>[T{t-j)-T(-j)] 


;=o 


’■)  >  0  )  -00  <  i"  <  0  , 


Pit)  =  Poexp 
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Model 

Po  [megatons] 

a  [yr-*] 

SSR 

Simple  Exponential 

157 

0.0313 

64.2  X  10^ 

Rust  and  Kirk  (1982) 

181 

1.20 

Boxcar  Window 

182 

1.13 

10.1 

8.52  X  10  5 

Triangle  Window 

176 

1.22 

16.3 

7.82  X  10  5 

Half-Gaussian  Window 

175 

1.23 

9.95 

7.53  X  10  5 

Transfer  Function 

178 

1.18 

14 

7.39  X  10® 

where  /3j  =  ^WjWj.  This  is  a  nonlinear  transfer  function 
model  with  n  +  3  fitting  parameters  Poi  Q,  j3o,  /3i, . . . ,  /3„. 
The  unit  integral  restriction  on  the  memory  function  im¬ 
plies  that  /?  =  /3j,  so,  having  calculated  that  value, 

the  discrete  memory  function  estimates  can  be  obtained 
from  the  transfer  coefficients  by 

Wj  =  -p-  ,  j  =  0,1,2,  ...,n  . 

I3wj 

Our  strategy  for  determining  r  was  to  make  n  as  large 
as  possible  with  /3j  >  0,j  =  1,2,  ...,n.  The  result  was 
n  =  13  (r  =  14).  The  estimated  memory  function  is 
plotted  as  connected  circles  in  Fig.  3,  and  the  other 
parameters  are  given  in  the  last  row  of  the  table.  The 
estimated  P{t),  shown  as  a  solid  line  in  Figs.  2  and  6, 
tracks  the  measured  data  remarkably  well. 

Critics  have  claimed  that  16  adjustable  parameters 
should  give  a  good  fit  using  any  time-series  for  T(<). 
Therefore,  we  generated  300  artificial  T{t)  records,  us¬ 
ing  the  least  squares  fit  in  Fig.  1  as  a  baseline  and  adding 
normally  distributed  random  deviates  with  mean  0  and 
variance  equal  to  that  of  the  real  temperatures  about 
that  baseline.  Repeating  the  fit  for  each  of  those  records, 
we  obtained,  in  every  case,  one  or  more  <  0.  We  also 
obtained  the  SSR  distribution  shown  in  Fig.  4  where 
vertical  lines  mark  the  mean,  —  Icr,  -2<t  and  — 3<t  points, 
and  the  black  square  marks  the  SSR  for  the  real  temper¬ 
atures.  Clearly,  the  probability  of  obtaining,  by  chance, 
such  a  low  value  of  SSR,  with  all  /?;  >  0,  is  negligible.  In 
fact,  using  measured  data  which  averaged  both  the  land 
and  marine  temperatures  gave  4  negative  jSj  valuse  and 
doubled  the  SSR.  A  good  fit  is  obtained  only  with  mea¬ 
sured  temperatures  for  the  Northern  Hemisphere  land 
surface  where  most  fossil  fuel  is  consumed. 

Lovelock  (1979)  propounded  the  Gaia  hypothesis 
which  postulates  that  life  regulates  and  maintains  the 
conditions  needed  to  assure  its  survival.  He  noted  that 
Earth’s  surface  temperature  has  been  nearly  constant 
for  the  3  X  10®  year  history  of  life,  even  though  the 
Sun’s  luminosity  has  increased  tenfold  in  that  time.  Sur¬ 
face  temperature  depends  critically  on  the  concentration 
of  greenhouse  gases  in  the  atmosphere.  According  to 


the  Gaia  hypothesis,  a  warming  caused  by  fossil  fuel 
consumption  should  produce  a  feedback  curtailing  that 
consumption.  The  inverse  modulation  identified  in  the 
present  study  may  represent  just  such  a  feedback. 

Fossil  fuel  production  is  an  indicator  of  economic 
vigor.  It  is  not  yet  possible  to  predict  future  temper¬ 
ature  variations,  so  the  model  described  here  can  only 
make  provisional  predictions  of  future  production  by  as¬ 
suming  various  temperature  scenarios.  Three  such  sce¬ 
narios  are  shown  in  Fig.  5,  where  circles  represent  mea¬ 
surements  (1971-1988),  and  triangles,  squares,  and  di¬ 
amonds  represent  20  years  of  increasing,  stable,  or  de¬ 
creasing  temperatures,  respectively.  The  corresponding 
provisional  predictions  are  shown  in  Fig.  6  where  circles 
represent  measurements  (1971-1986),  the  solid  line  rep¬ 
resents  model  predictions  for  those  years,  and  the  trian¬ 
gles,  squares,  and  diamonds  are  model  predictions  for  the 
3  future  scenarios.  The  initial  prediction  in  all  cases  is 
for  4  or  5  years  of  declining  or  static  production.  There¬ 
after,  the  cooling  scenario  predicts  spectacular  recovery, 
the  stable  temperatures  predict  4  or  5  additional  years 
of  static  production  followed  by  recovery,  and  the  con¬ 
tinued  warming  scenario  predicts  4  or  5  years  of  further 
declines  followed  by  an  almost  static  production  .  The 
production  totals  since  1986  are  not  yet  available,  but 
the  current  economic  situation  does  not  contradict  the 
initial  predictions. 
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Abstract 

We  address  the  problem  of  influence  in  estimating  the 
smoothing  parameter  when  fitting  a  univariate  smooth¬ 
ing  spline.  Using  the  local-influence  methods  of  Cook 
(1986),  a  diagnostic  is  derived  to  identify  observed  re¬ 
sponses  that  locally  influence  the  choice  of  smoothing 
parameter  by  generalized  cross-validation.  The  diagnos¬ 
tic  motivates  a  discussion  of  an  apparent  sensitivity  of 
generalized  cross-validation. 


1  Introduction 

Consider  scalar  responses  j/j  generated  according  to  the 
model  j/j  =  fi{tj )  -f  t; ,  where  /r  is  a  “smooth”  regression 
function,  a  <  ti  <■■■<  tn  <  b,  and  the  errors  (j 
are  uncorrelated,  with  zero  mean  and  constant  variance. 
We  assume  that  /i  is  smooth  enough  to  belong  to  the 
set  W^[a,  6]  of  functions  g  that,  for  some  fixed  m,  have 
m  —  1  continuous  derivatives  and  square-integrable  mth 
derivative  in  [a,  fr].  The  smoothing  spline  estimator 
of  n  satisfies  a  penalized  least  squares  criterion:  it  is  the 
minimizer  over  g  6  of 

+  A  f{g^^Ht)ydt,  A>0. 

(1.1) 

Here,  the  integral  is  a  penalty  for  roughness  in  the  spline. 
For  a  fixed  value  of  A  >  0,  the  smoothing  spline  is  a  linear 
smoother,  i.e.,  there  is  a  “hat”  matrix  //>,  depending 
only  on  the  design  points  {U}  and  A,  that  transforms  the 
data  vector  y  into  the  vector  of  smoothing  spline  fitted 
values:  //^y  =  Discussions  of  smoothing  splines  in 
statistics  may  he  found  in  Wegman  and  Wright  (1983), 
Silverman  (1985),  Eubank  (1988),  and  Wahba  (1990). 
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One  chooses  the  smoothing  parameter  A  in  (1.1)  to 
balance  the  competing  aims  of  smoothness  and  close  fit 
to  the  data.  Small  values  of  A  produce  rougher  curves 
that  follow  the  data  more  closely  while  large  values  of  A 
give  smoother  curves.  A  populcir  data-driven  method  for 
selecting  A  is  generalized  cross-validation  (GCV),  intro¬ 
duced  by  Craven  and  Wahba  (1979).  The  GCV  choice  A 
minimizes  over  A  >  0 

•  ''■2> 

where  I  is  the  n  x  n  identity  matrix  and  ex  =  {1  —  Hx)y 
is  the  vector  of  residuals.  For  discussions  of  GCV  in 
related  smoothing  problems,  see  Li  (1985),  Hall  and  Tit- 
terington  (1987),  and  Hardle,  Hall,  and  Marton  (1988). 

Analogues  of  familiar  linear-regression  diagnostics, 
based  on  case  deletion,  have  been  proposed  for  smooth¬ 
ing  splines;  see  Wendelberger  (1981),  Eubank  (1984, 
1985),  Silverman  (1985),  Eubank  and  Gunst  (1986). 
However,  no  diagnostics  for  GCV  have  appeared  in  the 
literature.  Although  case-deletion  diagnostics  for  the 
GCV  choice  A  are  an  obvious  approach,  they  are  com¬ 
putationally  infeasible  for  large  datasets.  Further,  as 
will  be  discussed  in  Section  3,  the  estimate  A  is  appar¬ 
ently  sensitive  to  groups  of  observations  acting  together 
rather  than  single  outlying  points.  Hence  case-deletion 
diagnostics  may  not  be  very  relevant. 

2  A  diagnostic  for  influential  re¬ 
sponses 

To  develop  a  diagnostic  when  influential  groups  of  cases 
are  a  possibility,  a  natural  approach  is  to  perturb  all 
observations  simultaneously,  rather  than  modifying  or 
deleting  single  cases.  To  do  this,  we  add  a  vector  u  of 
small  perturbations  to  produce  y^,  =  y  +  u.  Through 
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their  local  influence  (Cook,  1986),  we  can  identify  groups 
of  responses  that  play  a  large  role  in  the  determination 
of  the  GCV  estimator  A. 

For  each  w,  GCV  applied  to  the  perturbed  data  yu, 
selects  A(w).  This  defines  a  map  w  i— ►  A(w)  as  u)  ranges 
in  an  open  set  about  w  =  0,  where  A(0)  =  A  from  the  un¬ 
perturbed  data.  We  approximate  the  surface  A(w)  with 
its  tangent  plane  at  w  =  0  and  find  the  direction  of  max¬ 
imum  slope  <max  on  this  plane.  It  can  be  shown  that 
^max  dX{u>)/dui'^ ,  evaluated  at  ui  =  0.  For  the  pertur¬ 
bation  defined  above,  it  is  straightforward  to  calculate 

W  «  (cf -//)(/- //)V,  (2.1) 

where  c  =  tr  {H(I  -  //)}/tr  (/  -  H)  and  H  = 

The  essential  idea  is  that  a  direction  of  large  lo¬ 
cal  change  in  the  A(w)  surface  at  A(0)  =  A  corre¬ 
sponds  to  perturbation  of  influential  responses,  the  vec¬ 
tor  <max  approximates  this  direction,  and  therefore  large 
components  of  t^ax  flag  locally  influential  observations. 
The  diagnostic  is  a  plot  of  <max  against  case  number, 
where  cases  with  relatively  large  absolute  components 
are  jointly  influential.  The  sign  of  a  component  indicates 
the  direction  in  which  to  alter  the  response  to  produce 
a  large  (local)  change  in  A. 

3  Sensitivity  of  GCV 

To  illustrate  the  diagnostic,  we  consider  the  data 
shown  in  Figure  1,  generated  by  adding  independent 
Uniform[— 3,  3]  errors  to  the  sinusoidal  mean  function 
indicated  by  the  dotted  curve.  The  solid  curve  is  the 
periodic  cubic  smoothing  spline  (defined  below)  fitted  to 
the  simulated  data,  using  GCV  to  select  A  =  7.6x10“^  w 
5{(n—  l)/2}“'’.  An  index  plot  of  i^ax  (not  shown)  iden¬ 
tifies  five  jointly  influential  cases,  marked  with  filled  cir¬ 
cles:  moving  the  responses  for  cases  17,  19,  and  31  in  one 
direction,  and  cases  4  and  18  in  the  opposite  direction 
will  produce  a  large  local  change  in  A.  Note  that  the 
locally-influential  responses  are  not  “outlying”  points. 

Some  experience  with  the  diagnostic  suggests  that 
when  A  is  large,  it  is  not  particularly  sensitive  to  small 
subsets  of  observations.  However,  when  A  is  very  small, 
it  seems  to  be  sensitive  to  groups  of  observations  which 
make  it  appear  that  the  regression  function  has  impor¬ 
tant  high-frequency  components.  This  can  be  made  pre¬ 
cise  by  examining  what  happens  to  the  high-frequency 
components  of  y  in  GCV  and  the  diagnostic.  For  sim¬ 
plicity,  we  consider  the  special  case  of  periodic  smooth¬ 
ing  splines  (Eubank,  1988,  sec  6.3.1)  where  the  mapping 
from  the  “time  domain”  (y)  to  the  frequency  domain  is 
particularly  transparent.  However,  the  ideas  extend  in 
principle  to  the  general  case. 


For  periodic  cubic  splines  (m  =  2),  we  assume 
the  model  (1.1)  and  in  addition  that:  (i)  are 

equally  spaced  in  [0, 1],  (ii)  fi  is  smoothly  periodic  in  the 
sense  that  fi(0)  =  fi(l)  and  /i^^^(O)  =  /iC)(l),  and,  for 
simplicity,  (iii)  n  is  odd.  Write  the  Fourier  transform  of 
y  as  f(y)  =  Xy/n,  where  X  is  the  n  x  n  matrix  with 
rows 

xj  =  (l,exp(25rir/n), . . .  ,exp{27ri(n  -  l)r/n}), 

in  the  order  r  =  — (n  —  l)/2, . . .  ,(n  —  l)/2,  and  where 
=  —1.  The  Fourier  coefficient  of  y  for  the  frequency 
r/n  is  the  rth  component  of  f(y), 

friy)  =-  exp{2xi{k  -  l)r/n}, 

it=i 

so  that  the  fr  for  large  |r|  correspond  to  high  frequencies. 
Then  |/r(y)P  is  the  power  of  the  signal  y  at  frequency 
r/n. 

The  cubic  periodic  spline  estimate  of  /i  for  fixed 
A  >  0  is,  to  a  high  order  of  approximation,  /tx  = 
X^WXy/n  =  Ar”Wf(y),  where  is  the  hermitian,  or 
conjugate  transpose,  of  A,  and  W  =  IV(A)  is  a  diagonal 
matrix  with  diagonal  elements  u>r(A)  =  (1  -h  Ar”*)"*,  for 
r  =  — (n  —  l)/2, . . . ,  (n  -  l)/2.  Since  the  weights  Wr{X) 
decrease  with  increasing  frequency  |r/n|,  W  acts  as  a 
low-pass  filter  in  the  frequency  domain  which  smooths 
the  data  by  damping  high-frequency  components  of  y. 
The  amount  of  damping  depends  on  A:  small  values  pro¬ 
duce  less  damping,  large  values  more. 

To  examine  influence  on  GCV,  we  rewrite  t^ax  in 
(2.1)  as  a  function  of  the  Fourier  coefficients  of  the  data 

f(y)- 

<maxOcA»(c/-W)(/-W)2f(y), 

where  c  is  defined  below  (2.1).  The  filter  {cl  —  W)(/  — 
W)^  is  increeising  in  |r/n|  for  all  values  of  A  and  so  acts 
as  a  high-petss  filter,  increasing  the  output  power  at  high 
frequencies.  Thus,  <max  has  large  absolute  components 
corresponding  to  groups  of  responses  which  make  large 
contributions  to  the  high-frequency  components  of  the 
data  y. 

Finally,  the  GCV  criterion  (1.2)  can  be  expressed 
in  terms  of  the  power  |/r(y)P  at  various  frequencies  as 

G(A)=  ^^r{X)\fr{y)\\ 

|r|<(n-l)/2 

where 
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Note  that,  in  contreist  to  the  weights  uv(A)  for  the  peri¬ 
odic  spline,  the  GCV  weights  ^r(A)  are  strictly  increas¬ 
ing  with  \r/n\,  so  that  high-frequency  components  of 
the  data  may  have  a  larger  role  in  determining  A.  The 
amount  by  which  high  frequencies  outweigh  low  frequen¬ 
cies  depends  critically  on  the  value  of  A.  Figure  2  shows 
several  sequences  of  GCV  weights  0r(A)  with  n  =  51,  for 
A  equal  to  Ao  =  {(n  —  IOAq,  10^ Aq,  and  10“* Aq. 

When  A  is  near  {(n  —  l)/2}“'*,  high  frequencies  receive 
substantially  greater  weight.  Thus,  when  GCV  is  mini¬ 
mized  at  a  very  small  A,  it  may  be  driven  by  small  groups 
of  cases  which  contribute  to  the  power  of  y  at  high  fre¬ 
quencies.  When  GCV  is  minimized  at  a  large  A,  it  is 
relatively  insensitive,  since  higher  and  lower  frequences 
have  nearly  equal  weight. 
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Figure  1.  Simulated  data  based  on  a  periodic  regression  function  (dotted  curve)  with  a  cubic  spline  fit 
(solid  curve).  Data  is  plotted  against  case  number  rather  than  t. 


CM 


Figure  2.  GCV  coefficients  6r{X)  vs.  |r|,  for  several  values  of  A,  given  as  multiples  of  Ao  =  {(n- 1)/2}  * 
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1  Introduction 

Let  (X,Y)  be  a  bivariate  random  vector  with 
£|y|  <  c».  The  nonparametric  regression  problem  is 
to  estimate  the  regression  function 

r(x)  =  E(Y\X  =  x)  (1) 

based  on  a  random  sample  (ATiiU),  i=l,- • -jn  from 
(X.  Y). 

The  Nadaray a- Watson  (NW),  the  Nearest  Neighbor 
(NN),  and  the  Optimal  Quantile  (OQ)  kernel  type  es¬ 
timators  of  r(z)  defined  in  (2)-(4)  depend  on  smooth¬ 
ing  parameters  h,  k  and  p,  respectively.  The  asymp¬ 
totic  optimal  form  of  these  smoothing  parameters  is 
known,  see  Collomb  (1977)  and  Mack  (1981).  This 
information,  however,  is  not  sufficient  in  practical  ap¬ 
plications  and  data  driven  (DD)  methods  for  choosing 
smoothing  parameters  have  been  developed,  see  Hall 
(1984),  Rice  (1984),  Hardle  and  Marron  (1985),  Mar- 
ron  and  Hardle  (1986),  Bhattacharya  and  Mack  (1987), 
Hardle,  Hall  and  Marron  (1988)  and  Kozek  and  Schuster 
(1990).  One  popular  DD  method  of  choosing  smoothing 
parameters,  the  so  called  leave-out-one-at-a-time  cross- 
validation  principle  (CVP)  ,  chooses  the  smoothing  pa¬ 
rameter,  say  V,  to  minimize  CV(v)  of  equation  (6).  The 
CVP  measures  fit  of  the  estimator  to  the  data  on  the 
set  {Xi,-'-,X„}.  This  criterion  imposes  no  condition 
on  the  behavior  of  the  curve  between  the  points  Xi.  It  is 
not  surprising  then  that  we  frequently  observe  excellent 
fit,  but  simultaneously  nonregular  behavior  elsewhere. 
It  is  well-known  (and  has  been  our  experience)  that  the 
CVP  tends  to  choose  an  estimator  which  overfits  the 
data.  This  lack  of  smoothness  is  often  visible  in  small 
or  moderate  sample  sizes,  say  10-50.  In  fact,  for  sam¬ 
ple  sizes  this  small,  the  CV  function  is  often  degenerate 
in  the  sense  that  it  is  not  defined  for  small  values  of 
the  smoothing  parameter  and  it  increases  on  the  inter¬ 
val  where  it  is  well  defined  (see  Figure  1). 

What  we  desire  in  practice  is  a  uniform  behavior  of  the 
regression  estimator  and  its  derivatives.  In  this  context, 
many  interesting  ideas  have  appeared  in  spline  estima¬ 
tion  for  the  case  of  nonrandom  X  and  special  experimen¬ 
tal  designs,  see  Wahba  (1990)  and  Eubank  (1990).  The 
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key  of  the  success  here  seems  to  be  in  the  formulation  of 
minimization  criterion  : 

•  minimize  a  simple  expression  which  penalizes  both 
for  the  lack  of  fit  and  for  the  lack  of  smoothness. 

In  Section  2  we  propose  our  criterion  for  the  adaptive 
choice  of  the  smoothing  parameter  for  kernel  type  es¬ 
timators  in  case  of  random  X.  As  in  spline  theory,  we 
penalize  both  for  lack  of  fit  and  for  lack  of  smoothness. 

2  Fit  the  Short  Curve  Principle 

We  restrict  our  consideration  to  the  following  three  ker¬ 
nel  type  estimators  of  the  regression  function 

?h(a:) 


nix) 


fp(x) 

where  K  is  a  nonnegative  kernel  and  k,  dfc(i),  and  q,,(x) 
are  window  bandwidths  corresponding  to  the  Nadaraya- 
Watson  (Nadaraya  (1964),  Watson  (1964))  ,  k-th  nearest 
neighbor  (Cover(1968),  Collomb  (1980))  and  p-th  opti¬ 
mal  quantile  (Kozek  and  Schuster  (1990))  estimators,  re¬ 
spectively.  Here  cffc(z)  is  the  distance  from  x  to  its  A:-th 
nearest  neighbor  in  the  sample  Xi,  ■  ■  ■ ,  X„  and  q,,(x)  is 
the  p-th  quantile  corresponding  to  Qnl  ),  ^  continuous 
linearly  smoothed  version  of  the  empirical  distribution 
function  Qn(  )  based  on  |i  -  Xi|,  ,  |z  -  X„|.  <?^(  ) 
given  by 

f  0  if  t  <  di(x) 

g^(t)=  lk-l  +  j-±^df^]/[n-l)  if  t€l,(x) 

[  1  iff><i„(x) 

where  <ii(z)  <  <  dn{x)  are  ordered  quantilie.< 

li  -  Xi|,  ■•■,|z  -  X„l,  and  4(1)  =  [4(i),  4+ ,  (z)). 
Whenever  any  of  the  estimators  (2)-(4)  is  not  well  de¬ 
fined,  i.e.  its  denominator  equals  zero,  we  assign  a  large 
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constant  for  its  value,  say  10^°.  Such  a  convention  is  use¬ 
ful  from  a  numerical  point  of  view  and  has  been  implied 
in  the  Fit  Short  (FS)^  package.  The  simulations  via  FS 
have  been  made  for  a  variety  of  long  and  short  tailed  ker¬ 
nels  including  the  Gaussian  kernel  and  the  continuously 
differentiable  (compact  support)  quartic  kernel 

f  for  2  €  1-1,1] 

K(x)  =  ]  (5) 

I  0  otherwise. 

The  estimators  fh(x),fk(x)  and  rp(2)  depend  on  the 
smoothing  (window  width)  parameters  h,  k,  and  p.  Let 
r„(2)  stand  for  the  estimator  corresponding  to  a  param¬ 
eter  V  S  {h,  fc,  p}. 

To  penalize  for  lack  of  fit  we  use  the  CV(v)  function 

(6) 

t=l 

where  fi~'^(x)  is  given  by  one  of  (2)-(4),  but  is  based  on 
all  data  pairs  except  for  leaving  out  (A",-,!^).  Let  L(t;) 
be  a  function  to  penalize  for  roughness.  Since  overfit¬ 
ting  the  data  would  tend  to  produce  estimators  r„(2) 
whose  derivatives  were  large  in  magnitude  midway  be¬ 
tween  adjacent  observations  on  the  independent  variable, 
we  looked  for  natural  functionals  L(t;)  which  penalize 
for  large  values  of  (r(,(i)|  at  the  midpoints  of  the  order 
statistics,  i.e.  at  points  (A’(<_i)  +X(i))/2. 

Penalty  functionals  L(v)  we  considered  were  the 
length,  the  total  variation,  and  the  curvature  of  the  es¬ 
timator  f„(2).  A  natural  criterion  to  impose  on  a  penal¬ 
izing  procedure  is  that  it  produces  a  smoothing  param¬ 
eter  which  is  invariant  unJ  ■  linear  transformations  on 
the  dependent  variable  Y.  The  total  variation  (the  inte¬ 
gral  of  the  absolute  value  of  the  first  derivative)  and  the 
global  measure  of  curvature  (the  integral  of  the  square 
of  the  second  derivative)  possess  this  invariance  prop¬ 
erty.  The  length  (the  integral  of  the  square  root  of  1 
plus  the  square  of  the  first  derivative)  does  not.  The 
curvature  functional  possesses  desirable  theoretical  and 
computational  properties  as  a  penalty  function  in  cubic 
spline  estimation  which  are  not  present  in  the  present 
problem.  When  the  derivative  is  large,  the  length  and 
variation  functions  produce  essentially  the  same  values. 
Moreover,  the  length  criterion  is  more  easily  understood 
by  practitioners  and  seemed  to  work  somewhat  better 
than  the  variation  functional  in  our  experimentation. 
For  these  reasons,  we  have  chosen  to  present  our  pe¬ 
nalizing  criterion  for  the  length  functional.  Our  general 
approach  applies  to  any  of  the  DD  criterions  discussed 
in  Hardle  et  al.  (1988)  and  any  penalizing  functional. 
For  the  sake  of  simplicity,  however,  we  restrict  our  at¬ 
tention  to  the  CVP  and  we  penalize  for  estimators  with 
excessive  length.  We  use  the  following  Riemann  sum 

'FS  was  developed  in  1991  at  U.T.  El  Paso  with  the  as¬ 
sistance  of  Krzysztof  Kozek. 


approximation  to  the  length  of  f„(T)  (and  r(z))  on  an 
interval  [A,B): 

Hv)  =  A  A.  V^(r;(A*))"  (7) 


where  AA^  =  Ajjj  —  A(j_i),  X*  =  (A(,_i)  +  A(.))/2, 
r„(2)  is  an  approximation  to  the  derivative  of  the  esti¬ 
mator  r„(2)  at  X  and  the  sum  is  over  order  statistics 
A’(^)  in  an  interval  [A,  B].  Most  of  our  simulations  were 
run  with  A  and  B  corresponding  to  symmetric  trimming 
of  0-10%  of  the  data  pairs  corresponding  to  the  smallest 
and  largest  A(i)’s.  The  FS  package  simplifies  computa¬ 
tions  of  the  derivatives  of  the  NN  and  OQ  estimators, 
by  treating  dic{x)  and  9,, (2)  as  constants. 

Our  Fit  the  Short  Curve  Principle  (FSCP)  can 
now  be  described  as  follows: 

•  find  the  smoothing  parameter  Uq  minimizing  the 
cross  validation  term  CV  (u)  in  (6), 

•  choose  the  smoothing  parameter  v  which  minimizes 


F5C(u) 


CV'(u)  L(v} 
CV(vo}  ^  L(t;o)‘ 


(8) 


We  prefer  (8)  to  any  gauge  function  of  CV  (u)  and  L(v) 
we  have  tried.  CV'(vo)  will  tend  to  underestimate  the 
mean  square  error  of  the  regression  estimator  and  L(vit} 
will  tend  to  overestimate  length.  Thus  the  FSC  function 
of  (8)  will  tend  to  weigh  the  fit  criterion  more  heavily  and 
should  be  near  2  when  properly  gauged.  We  did  iterate 
this  procedure.  However,  there  was  little  improvement 
in  the  estimators  produced  in  the  second  iteration. 

In  the  next  section  we  shall  see  that  one  can  use  any 
DD  criterion  including  FSCP  to  select  the  bandwidth  pa¬ 
rameter  in  some  specified  envelope  and  retain  the  strong 
consistency  of  f„(2).  Computer  simulations  using  FSCP, 
examples,  and  conclusions  are  discussed  in  Section  4. 


3  The  strong  consistency 

In  this  section  we  show  that  if  some  asymptotic  restric¬ 
tions  on  the  bandwidth  parameter  sequence  are  imposed 
and  some  mild  regularity  conditions  are  satisfied,  then 
any  data  driven  choice  of  the  bandwidth  leads  to  a  point- 
wise  strongly  consistent  sequence  of  estimators.  Hence 
we  can  conclude  that  estimators  (2)-(4),  with  window 
width  parameter  selected  from  an  envelope  by  the  FSCP, 
converge  pointwise  with  probability  1  to  the  true  regres¬ 
sion  function.  In  this  direction  let 


C(n)  \  1 

as 

n  — ►  00, 

(^) 

h(n) 

= 

(10) 

hi(n) 

= 

h[n)lC{n], 

(11) 

/i2(n) 

= 

C(n)h(n), 

(12) 

Ci^(WI)< 

K(x} 

<  C2//(|2|)  0  <  Cl  <  C;, 

(i;<) 

K(x)  >  c  >  0 

*7 

[z]  <  r  some  c,  r  >  0, 

(11) 

K(Ax) 

is 

nonincreasing  inX,  X  >  0, 

(1.^) 
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where  K  is  a  Borel  kernel  and  H  is  a  nonnegative, 
decreasing,  bounded,  Lebesgue  integrable  function  on 
z  >  0.  FVom  Theorem  1  in  Kozek  and  Schuster  (1991) 
we  infer  the  convergence  of  general  DD  kernel-type  esti¬ 
mators. 

Theorem  3.1  Let  (X,  K),  (Xi,  Yi), . . . ,  (X„,  y„)  be  in¬ 
dependent,  identically  distributed  bivarate  random  vari¬ 
ables  suck  that  the  probability  distribution  p,  of  X  has  no 
singular  component  and 

E|Y|(log(H- 1^1))*+*  <  oo /or  some  5  >  0.(16) 

Let  the  kernel  K  and  the  bandwidth  hi  and  /12  sat- 
isfy  (9).  ..(15).  If  hi  <  h  <  /12  and  h  — 
h{n,x,Xi,Yi,...,Xn,yn),  then  the  NW  estimator  (2) 
with  the  bandwidth  h  is  for  p-a.e.  x,  convergent  with 
probability  one  to  r(z). 

Corollary  3.2  If  the  minimization  of  FSC  in  (8)  is 
over  V  —  h  €.  \Dn-^/^/C{n),  Dn~^/^C{n)\,  then  the  NW 
estimator  (2)  with  the  bandwidth  h  chosen  by  the  FSCP 
is  for  ji-a.e.  x,  convergent  with  probability  one  to  r(z). 

FVom  Theorem  3.1  it  follows  that  the  strong  consis¬ 
tency  does  not  impose  any  severe  restrictions  upon  the 
choice  of  the  sequence  C(n)  in  (9).  It  allows  adjustments 
suitable  to  local  predilections. 

FVom  Theorems  2  and  3  of  Kozek  and  Schuster  (1991) 
it  follows  that  if  X  has  an  almost  everywhere  continuous 
Lebesgue  density,  then: 

Corollary  3.3  If  the  minimization  of  FSC  in  (8)  is 
over  V  =  k  e  {Dn*/yC(n),  Dn*/^C{n)\,  then  the  NN 
estimator  (S)  with  the  bandwidth  parameter  k  chosen  by 
the  FSCP  is  for  p.-a.e.  x,  convergent  with  probability  one 
to  r{x). 

Corollary  3.4  If  the  minimization  of  FSC  in  (8)  is 
over  V  =  p  €  [Dn“'/^/C(n),  Dn~^I^C{n)\,  then  the  OQ 
estimator  (4)  with  the  bandwidth  parameter  p  chosen  by 
the  FSCP  is  for  /i-a.e.  x,  convergent  with  probability  one 
to  r(z). 

4  Conclusions  and  Examples 

We  summarize  our  simulation  experience  with  the  esti¬ 
mators  (2)-(4)  using  FSCP  as  implemented  in  the  sta¬ 
tistical  package  FS  running  on  IBM  XT,  AT,  PS/2  or 
on  IBM  compatible  personal  computers.  FVequently,  for 
samples  of  sizes  10-100,  the  CV  function  is  strictly  in¬ 
creasing  on  the  interval  where  it  is  well  defined.  Since  the 
length  functional  decreases  rapidly  as  window  width  in¬ 
creases  one  can  heuristically  argue  that  the  FSCP  has 
the  desired  effect  we  observed  in  all  simulations,  i.e. 
FSCP  chooses  a  larger  window  than  that  given  by  the 
CVP  alone.  As  a  result  the  estimators  obtained  by  the 
FSCP  are  smoother  than  those  obtained  by  the  CVP. 
Typically,  the  FSCP,  in  contrast  to  the  CVP,  has  an 
objective  function  FSC(v)  with  a  well  determined  min¬ 
imum  (see  Figure  1)  occuring  at  a  point  which  seems  to 
reasonably  balance  fit  with  smoothness. 


Poor  results  were  obtained  for  kernels  which  can  as¬ 
sume  negative  values.  Kernels  of  this  type  possess  op¬ 
timal  properties  in  the  fixed  design  case  of  nonrandom 
X  which  do  not  seem  to  be  present  for  the  ratio  type 
kernel  estimators  (2)-(4)  in  the  bivariate  case  of  ran¬ 
dom  X.  The  quai  tic  kernel  in  (5)  is  a  computationally 
simple,  symmetric,  unimodal,  and  continuously  differen¬ 
tiable  type  kernel  desired  in  the  FSCP.  Overall  it  worked 
well  for  NW  estimators  with  little  difference  between  the 
CVP  and  the  FSCP  criterion  in  cases  where  there  were 
no  large  spacings  among  the  X/s.  NW  estimators  us¬ 
ing  the  CVP  with  Guassian  or  Student’s  t  kernels  were 
occasionly  quite  rough.  The  FSCP  based  estimator  fre¬ 
quently  showed  substantial  improvement  in  these  cases. 

Our  experience  leads  us  to  believe  that  there  are  inher¬ 
ent  limitations  with  kernel  estimators  of  the  NW  type  of 
(2)  which  utilize  a  constant  window  width.  Estimators 
NN  and  OQ  of  (3)-(4)  allow  for  varying  window  width 
and  adapt  to  the  local  density  of  the  X  variable.  The 
NN  estimator  is  quite  rough,  particularly  for  small  sam¬ 
ples,  and  is  computationally  awkward  and  titne  consum¬ 
ing  to  analyze  using  CVP  or  FSCP  type  criteria.  The 
OQ  estimator  is  a  smoothed  version  of  the  NN  estimator 
studied  by  Kozek  and  Schuster  (1990)  which  moderates 
these  difficulties  and  seemed  to  perform  reasonably  well 
over  a  variety  of  regression  models.  It  seems  much  less 
sensitive  to  both  the  choice  of  kernel  and  the  presence 
of  large  spacings  in  the  Xj’s.  In  cases  where  the  CVP 
worked  well  there  was  often  no  significant  difference  be¬ 
tween  the  CVP  and  the  FSCP  OQ  estimates. 

When  properly  normalized,  the  denominators  of  tlie 
regression  estimators  (2)-(4)  estiiinte  the  density  of  X 
in  the  absolutely  continuous  case.  The  FS  package  in¬ 
cludes  an  option  to  compare  these  density  estimates  with 
the  true  density  in  simulations.  Our  experience  indicates 
strong  links  between  values  of  smoothing  parameters  cor¬ 
responding  to  good  regression  estimators  and  good  esti¬ 
mates  of  the  density  of  the  random  variable  X. 

To  illustrate  points  raised  in  our  discussions  we  have 
used  the  FS  package  to  take  a  random  sample  of  20 
pairs  from  the  regression  model  Y  =  X^  1  <•  where  .Y 
is  standard  normal  and  c  is  independent  of  X  and  nor¬ 
mally  distributed  with  mean  0  and  standard  deviation 
0.1.  Figure  1  contains  the  graphs  of  the  CV  and  FSC 
functions  using  the  Gaussian  kernel.  Figure  2  contains 
the  data  pairs,  the  true  regression  function,  the  N  W  esti¬ 
mator  with  smoothing  paramder  selected  by  both  C\T’ 
and  FSCP,  and  the  OQ  estimator  selected  by  F.‘'CI’.  Ten 
percent  trimming  was  used  in  (7)  for  these  examples. 
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Abstract 

We  present  an  efficient  network  algorithm  for  generating 
exact  permutational  distributions  for  linear  rank  tests 
defined  on  stratified  2  x  c  contingency  tables.  The  al¬ 
gorithm  can  evaluate  exact  one  and  two  sided  p-values, 
and  compute  exact  confidence  intervals  for  trend  param¬ 
eters  arising  from  certain  loglinear  and  logistic  models 
embedded  in  these  contingency  tables.  It  is  especially 
efficient  for  highly  imbalanced  categorical  data,  a  situ¬ 
ation  where  the  asymptotic  theory  is  unreliable.  Part 
of  the  algorithm  can  be  adapted  to  evaluating  the  con¬ 
ditional  maximum  likelihood  and  its  derivatives  for  the 
logistic  regression  model,  with  grouped  data.  We  illus¬ 
trate  the  techniques  with  an  analysis  of  two  data  sets;  the 
leukemia  data  on  the  Hiroshima  atomic  bomb  survivors, 
and  data  from  a  clinical  trial  of  bone  marrow  transplant. 


1  Introduction 

Linear  rank  tests  play  a  major  role  in  nonparametric  in¬ 
ference.  The  Chernoff-Savage  theorem  (1958)  ensures 
the  asymptotic  normality  of  these  tests,  and  indeed,  for 
continuous  data  the  asymptotic  results  work  very  well. 
By  the  time  the  sample  size  is  around  30,  there  is  very 
little  difference  between  the  asymptotic  distribution  of 
a  linear  rank  test  statistic  and  its  exact  permutational 
distribution.  However  this  is  not  the  case  for  categorical 
data.  Here  the  rate  of  convergence  to  asymptotic  normal¬ 
ity  depends  on  more  than  just  Sample  size.  The  number 
of  ties  in  each  category,  the  group  imbalance,  and  the 
choice  of  rank  scores,  all  affect  the  shape  of  the  permuta¬ 
tion  distribution  in  complicated  ways,  making  it  difficult 
to  predict  a  priori  whether  the  asymptotic  results  for  a 
given  data  set  are  reliable.  It  is  important  therefore  to 


develop  efficient  numerical  al'’orithms  to  supplement  ex¬ 
isting  asymptotic  results  for  the  categorical  case.  These 
algorithms  serve  both  the  data  analyst  concerned  about 
the  validity  of  the  inference  in  small,  sparse,  or  imbal¬ 
anced  data  sets,  and  the  theoretical  statistician  develop¬ 
ing  new  asymptotic  methods  and  wishing  to  confirm  that 
the  theory  is  accurate. 

This  paper  develops  a  very  fast  algorithm  for  generating 
exact  permutation  distributions  for  linear  rank  tests  de¬ 
fined  on  stratified  2x  c  contingency  tables.  The  permuta¬ 
tional  problem  is  formulated  very  precisely  in  Section  2. 
A  network  algorithm  for  solving  the  problem  is  presented 
in  Section  3.  A  major  strength  of  the  algorithm  is  that  its 
limits  of  computational  feasibility  increase  with  the  de¬ 
gree  of  imbalance  between  the  groups  being  compared. 
This  is  precisely  where  it  is  needed  most,  since  the  re¬ 
liability  of  asymptotic  results  decrease  as  the  imbalance 
increases.  In  another  paper  we  analyze  some  case-control 
data  in  which  the  total  sample  size  is  99,960.  Yet,  be¬ 
cause  of  the  severe  imbalance  between  cases  and  controls, 
the  asymptotic  results  differ  from  the  exact  ones.  The 
algorithm  developed  here  performs  exact  permutational 
inference  on  the  data  set  with  no  difficulty  whatsoever, 
despite  its  enormous  sample  size. 

The  inference  techniques  discussed  in  this  paper  are  con¬ 
ditional.  This  is  true  both  for  the  exact  as  well  as  the 
asymptotic  inference.  Exact  methods  for  parameter  es¬ 
timation  naturally  require  strong  numerical  algorithms. 
But  it  is  not  generally  recognized  that  conditional  infer¬ 
ence  places  a  heavy  computational  burden  on  the  maxi¬ 
mum  likelihood  estimation  as  well.  A  by-product  of  the 
algorithmic  development  in  Section  3  is  its  applicability 
to  the  problem  of  estimating  model  parameters  by  max¬ 
imizing  a  conditional  likelihood  function  and  evaluating 
its  first  two  derivatives.  Without  our  algorithm,  evaluat¬ 
ing  the  conditional  likelihood,  even  though  it  only  yields 
eisymptotic  estimates,  would  be  almost  as  difficult  as  the 
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exact  inference. 


2  Statistical  Formulation 

In  this  section  we  formulate  a  general  permutation  prob¬ 
lem  whose  solution  will  make  exact  statistical  inference 
possible  for  a  rich  class  of  linear  rank  tests,  defined  on  or¬ 
dered  categorical  or  binary  data.  The  computational  dif¬ 
ficulties  encountered  with  the  permutation  problem  are 
discussed,  setting  the  stage  for  the  development  of  an 
efficient  numerical  algorithm,  in  Section  3. 

2.1  Tabular  Representation  of  the  Data 

The  data  can  be  represented  cis  a  collection  of  s  2  x  c 
contingency  table  consisting  of  2  rows,  c  columns,  and 
s  strata.  A  specific  collection,  or  three  way  table,  of  this 
type,  denoted  by  x  =  (xi,X2, . .  .Xj),  is  displayed  below: 


1  Stratum  1  | 

Rows 

Col-1 

Coi.2  . . . 

Col-c 

Row-Total 

Row.l 

Xll 

X21 

Xcl 

mi 

Row.2 

=o',, 

<1 

7Tl  f 

Col-Total 

nu 

*121 

*lcl 

Ni 

Col-Score 

W2 

Wc 

1  Stratum  2  | 

Rows 

Col.l 

Col-2  . , . 

Col-c 

Row-Total 

Row.l 

Xi2 

X22 

Xc2 

m2 

Row-2 

x[^ 

X22 

^r.2 

mi 

Col- Total 

ni2 

7122 

nc2 

N2 

Col-Score 

m 

W2 

u/c 

1  Stratum  s  | 

Rows 

Col-1 

C0I.2  . . . 

CoIjc 

Row-Total 

Row.l 

Xls 

X2a 

Xcs 

ms 

Row.2 

»**» 

Col-Total 

**u 

Jl2, 

*lc. 

N, 

Col-Scwe 

m 

XV2 

Wc 

The  above  tabular  representation  accommodates  both 
the  comparison  of  two  multinomial  populations  and  the 
comparison  of  k  binomial  populations.  In  either  case  we 
may  adjust  for  possible  covariate  effects  by  stratification. 
Unstratified  data  may  be  regarded  as  a  special  case  with 
s  =  1. 

Two  Multinomial  Populations  The  two  rows  of 
stratum  k  represent  two  independent  multinomi^ 


populations.  Each  observation  falls  into  one  of  c  or¬ 
dinal  response  categories.  Thus  xjt  is  the  number  of 
stratum  k  observations,  out  of  a  total  of  m*,  falling 
into  ordered  category  j  for  population  1 ,  and  is 
the  number  of  stratum  k  observations,  out  of  a  total 
of  m^,  falling  into  ordered  category  j  for  popula¬ 
tion  2.  The  stratum  invariant  scores,  wi,W2,  ■  ■  -  Wc, 
are  numerical  values  assigned  to  the  c  ordered  multi¬ 
nomial  response  categories. 

Several  Biuomial  Populatious  The  c  columns  of 
stratum  k  represent  c  independent  binomial  popu¬ 
lations  with  row  1  representing  successes  and  row  2 
representing  failures.  For  population  j  and  stra¬ 
tum  k  there  are  xjk  successes  and  r'  j,  failures  in  nyt 
independent  Bernoulli  trials.  The  stratum  invari¬ 
ant  scores,  W\,W2,  ■  ■  .tVc  typically  represent  doses, 
or  levels  of  exposure,  affecting  the  success  rates  of 
the  c  binomial  populations. 

2.2  Exact  Conditional  Inference 

Define  the  reference  set  for  the  ikth  stratum,  Fj,  as  all 
possible  2  X  c  contingency  tables  whose  row  and  column 
margins  are  fixed  at  the  corresponding  values  of  the  ob¬ 
served  2  X  c  table,  x*: 

Tit  =  {yt;  yt  is  2  X  c;  yjt  -|-  y'*  = 

C  C 

j=I  ;=I 

Define  the  full  reference  set  as  the  cartesian  product  of 
the  reference  sets  across  all  s  strata: 

0  =  Fi  X  F2  X  ...  X  F,  =  {y:  yt  G  Ft,!*  =  1,2, ..  .s  }  . 

The  test  statistic,  T,  is  defined  as  a  sum  of  linear  rank 
statistics  over  the  s  strata: 

T  =  Ti-¥T2  +  ...  +  T,  , 

where  each  7/t  can  only  take  on  the  values  of  the  form 

C 

tk  —  ^  yjk  . 

>=1 

for  some  y*  G  Ft,  and  a  fixed  set  of  scores,  lei ,  u>2,  ■  •  •  u^. 
By  a  suitable  choice  of  scores  one  can  obtain  a  very  rich 
class  of  linear  rank  tests.  The  distribution  of  the  test 
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statistic,  T,  is  derived  by  limiting  the  sample  space  to 
yG0. 

Under  the  null  hypothesis  of  no  row  and  column  interac¬ 
tion  the  conditional  probability  distribution  of  Tt  given 
y*  €  r*  is 


fk{tk)  = 


(2.1) 


where 


Ti.u  =  {yt  e  r*:  • 

j=i 

Then  by  convolution,  the  conditional  probability  distri¬ 
bution  of  T,  given  y  €  0,  is 


/(<)  = 


Hjrg©,  nt=l 


nu 


(2.2) 


for  ib  =  1, 2, ...  s.  If  the  c  columns  of  each  stratum  repre¬ 
sent  data  from  c  binomial  populations,  the  above  proba¬ 
bilities  must  satisfy  the  constraints 

’Tji  +  =  1  . 

for  j  =  1,2,.  ..c,  and  k  =  1,2, ...s.  In  either  case  we 
assume  that  there  is  no  three-factor  interaction  so  that 
the  c  —  1  odds  ratios 


u».  - 

j  =  2, 3, . . .  c,  do  not  depend  on  k.  Next  we  model  these 
odds  ratios  as  a  function  of  the  scores.  If  the  data  have 
been  generated  from  two  stratified  multinomial  popula¬ 
tions,  it  is  natural  to  derive  the  odds  ratios  from  a  log- 
linear  model  with  a  linear  by  linear  row  times  column 
association  (Agresti,  1990,  page  275,  equation  (8.11)). 
In  the  present  context  the  linear  by  linear  model  speci¬ 
fies  the  following  expected  cell  counts  on  the  logarithmic 
scale: 

log(mi7rji)  =  Qjk  +/3wj 

for  row  1,  and 

\og[m'tw'jk)  =  otjt 


where 

s  c 

0«  =  {y  €  0:  X!  E  • 

*=1 i=i 

Notice  that  (2.2)  is  a  sum  of  generalized  hypergeometric 
probabilities  and  is  free  of  all  unknown  parameters.  This 
enables  us  to  compute  exact  p-values  for  all  the  linear 
rank  tests  listed  above.  We  can  also  compute  the  first  two 
moments  of  T  and  thereby  perform  asymptotic  inference 
by  appealing  to  the  Chernoff-Savage  theorem. 


for  row  2. 

If  the  data  have  been  generated  from  c  stratified  binomial 
populations  it  is  natural  to  derive  the  odds  ratios  from  a 
logistic  regression  model  (Cox,  1970); 

log  ^  =  Qt  +  0Wj  . 

^jk 

Both  models  yield  the  relationship 

log^j  = /?(u;j  -  lyi)  ,  (2.3) 


2.3  Parameter  Estimation 

For  data  arising  from  two  multinomial  distributions  or  c 
binomial  distributions,  we  can  specify  loglinear  and  logis¬ 
tic  models,  respectively,  for  the  data  generating  process. 
Let  Xjj,  be  the  probability  that  a  subject  from  stratum  k 
is  clEissified  as  falling  into  row  1  and  column  j.  Let 
be  the  probability  that  a  subject  from  stratum  k  is  clas¬ 
sified  as  falling  into  row  2  and  column  j.  If  the  two 
rows  of  each  stratum  represent  data  from  two  multino¬ 
mial  populations,  the  above  probabilities  must  satisfy  the 
constraints 

=  =  1  , 
j=i  j=i 


where  /?  is  an  unknown  parameter  to  be  estimated  from 
the  data.  It  can  be  shown  that  T  is  a  sufficient  statistic 
for  0  under  both  the  linear  by  linear  association  model 
and  the  logistic  regression  model.  Moreover,  the  con¬ 
ditional  distribution  of  T,  given  (yi,y2,  •  •  Vs)  €  0,  de¬ 
pends  only  on  0,  other  (nuisance)  parameters  being  elim¬ 
inated  by  the  conditioning.  This  conditional  distribution 
is  given  by 


/(<!/?)  = 


/(<)exp(/3<) 


(2.4) 


where  the  denominator  of  equation  (2.4)  is  simply  the 
normalizing  constant  obtained  by  summing  over  all  pos¬ 
sible  values  of  T.  When  0  ~  0  we  obtain  the  null  distri¬ 
bution  (2.2). 
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The  conditional  maximum  likelihood  estimate  (cmle)  of 
is  obtained  by  finding  the  value  of  /?  that  maximizes 
the  conditional  probability  (2.4)  at  the  observed  value 
T  =  Oq.  To  obtain  the  variance  of  the  cmle  we  need  the 
second  derivative  of  the  log  likelihood,  evaluated  at  the 
cmle.  Both  the  cmle  and  its  variance  may  be  rapidly 
evaluated  by  repeated  backward  induction  on  a  network, 
as  discussed  in  detail  in  Section  3.  We  can  then  use 
these  estimates  to  perform  asymptotic  hypothesis  tests 
or  compute  asymptotic  confidence  intervals  for  /?. 

To  obtain  an  exact  confidence  interval  for  /3  we  need 
the  coefficients  /(<)  for  all  values  of  T  in  the  tails  its 
distribution.  A  network  algorithm  for  this  computation 
is  described  in  Section  3.  Once  these  coefficients  have 
been  computed,  the  conditional  tail  probabilities,  T  > 
Oq,  OT  T  <  ao,  for  any  value  of  /?,  may  be  derived  from 
equation  (2.4).  Exact  confidence  bounds  for  /?  are  then 
obtained  by  inverting  corresponding  UMP  unbiased  tests 
for  /?,  as  shown  in  Cox  (1970).  For  example,  a  100(1  - 
q)%  lower  confidence  bound  for  /?,  say  j3(ao),  would  be 
obtained  as  the  solution  to 

t  ma  r 

^  /mao))  =  a  .  (2.5) 

(=ao 

The  solution  to  equation  (2.5)  may  be  rapidly  evaluated 
by  a  simple  binary  search  because,  as  shown  in  (2.4),  /(<) 
and  /?  are  separable  in  the  expression  for  /(<|/?). 

2.4  Computational  Issues 

From  the  above  discussion  it  is  clear  that  a  broad  class 
of  exact  linear  rank  tests  and  parameter  estimates  can 
be  obtained  if  we  are  able  to  compute  truncated  distri¬ 
butions  of  the  form 

n  =  {(<,/(<)):  <>ao}  •  (2.6) 

Exhaustive  enumeration  of  all  the  tables  in  0  for  gen¬ 
erating  f}  would  be  computationally  explosive.  Con¬ 
sider  the  simple  case  of  a  single  stratum,  no  ties,  and 
m  =  m'  =  N/2.  The  number  of  tables  in  the  reference 
set  0  for  various  values  of  N  is 


Sample  Size  (N) 

Tables  in  Reference  Set  (F) 

20 

1.8  X  10^ 

30 

1.5  X  10® 

40 

1.4  X  10" 

50 

1.3  X  10*“ 

100 

1.0  X  10^® 

If  there  were  s  strata,  the  size  of  the  corresponding  ref¬ 
erence  set  would  be  raised  to  the  sth  power.  It  is  clear 
that  even  in  the  very  powerful  computing  environment 
available  today,  explicit  enumeration  of  all  the  tables  in 
the  reference  set  0  rapidly  becomes  computationally  in¬ 
feasible.  However  much  recent  research,  for  example, 
Mehtaet.  al.  (1984)  (1985)  (1988),  Pagano  and  Tritchler 
(1983),  Tritchler  (1984),  Streitberg  and  Rohmel  (1986), 
and  Hollander  and  Pena  (1988),  has  focused  on  implicit 
enumeration  of  the  tables  in  0,  thereby  considerably  ex¬ 
tending  the  size  of  problem  for  which  exact  inference  is 
possible. 

Mehta,  Patel  and  Tsiatis  (1984),  and  Mehta,  Patel  and 
Wei  (1988),  developed  a  network  algorithm  for  implicit 
enumeration  of  all  the  2  x  c  contingency  tables  in  the  ref¬ 
erence  set  r,  defined  for  a  single  stratum  (s  =  1).  Mehta, 
Patel  and  Gray  (1985)  developed  a  network  algorithm  for 
implicit  enumeration  of  s  2  x  2  contingency  tables  (where 
s  >  1).  The  present  paper  generalizes  the  earlier  work 
to  s  independent  2  x  c  contingency  tables,  a  considerably 
more  difficult  problem.  An  alternative  method  would  be 
to  treat  the  s  2  x  c  problem  as  a  special  case  of  condi¬ 
tional  logistic  regression  and  directly  use  the  exact  algo¬ 
rithm  of  Hirji,  Mehta  and  Patel  (1988).  However  that 
would  not  exploit  the  special  structure  of  the  problem 
in  the  way  that  the  present  algorithm  does.  We  conjec¬ 
ture  that  the  algorithm  presented  here  is  the  fastest  one 
currently  available  for  categorical  data,  with  unequally 
spaced  Wj  scores.  In  another  paper  we  perform  e.xact  in¬ 
ference  on  some  rather  large  data  sets,  to  illustrate  how 
powerful  the  algorithm  is,  and  to  set  up  a  benchmark 
against  which  competing  algorithms  may  be  evaluated. 

A  second  contribution  of  this  paper  is  to  provide  an  effi¬ 
cient  numerical  algorithm  for  computing  the  cmle  for  0 
(equation  2.3)  and  its  standard  error.  A  previous  algo¬ 
rithm  for  this  problem,  in  the  more  general  conditional 
logistic  regression  setting,  was  developed  by  Gail,  Lu- 
bin,  and  Rubenstein  (1981).  Our  algorithm  is  equiva¬ 
lent  to  theirs  for  data  with  no  ties,  but  is  considerably 
more  efficient  for  categorical  data.  In  another  paper,  we 
show  that  the  Gail  et.  al.,  algorithm,  as  implemented 
in  the  EGRET  (1988)  software  package,  is  unable  to 
compute  conditional  maximum  likelihood  estimates  for 
a  large  heavily  tied  data  set,  whereas  our  algorithm,  ob¬ 
tains  the  required  estimates  very  rapidly. 


204  CJt.  Mehta,  N.  Patel,  and  P.  Senchaudhuri 


3  Numerical  Algorithms 

We  provide  numerical  algorithms  for  two  problems;  gen¬ 
erating  the  truncated  permutation  distribution  Cl,  de¬ 
fined  by  (2.6),  and  computing  the  cmle  for  /3,  say 
along  with  its  standard  error,  a.  Both  problems  are 
solved  within  one  unified  framework  wherein  the  refer¬ 
ence  set  0  is  represented  as  a  network.  We  will  see  that 
processing  the  network  in  the  forward  direction  yields 
n,  while  processing  the  same  network  in  the  backward 
direction  yields  and  its  standard  erToi. 

3.1  Generating  an  Overall  Truncated 
Permutation  Distribution 

Our  goal  is  to  generate  the  truncated  permutation  dis¬ 
tribution  Cl  for  T,  the  sum  of  linear  rank  statistics  across 
all  the  strata.  Our  strategy  will  be  to  generate  s  inde¬ 
pendent  stratum  specific  truncated  permutation  distri¬ 
butions  of  the  form 

fk{tk))  '■  tk  >ak}  , 

at  the  cut-off  points 

at  —  CIq  If, max  I 

i*k 

for,  k  =  1, 2, ...  s.  Here  tt.max  is  the  maximum  value  of 
the  random  variable  Tt ,  and  is  easily  evaluated  as  part 
of  the  backward  induction  step  discussed  below.  We  will 
perform  pairwise  convolutions  on  these  stratum  specific 
distributions  until  the  overall  distribution  is  obtained. 
Thus  there  are  two  steps  to  be  performed  repeatedly; 
a  distribution  generation  step,  and  a  convolution  step. 
These  steps  are  described  next  in  separate  subsections. 

3.1.1  Generating  Stratum  Specific  Truncated 
Permutation  Distributions 

Suppose  we  wish  to  generate  the  truncated  permutation 
distribution  Clt,  for  the  l;th  stratum.  In  principle  this 
involves  enumerating  all  the  2  x  c  contingency  tables 
y*  G  Ft,  computing  the  value  of  t*  =  ^iVjk  for 
each  one,  and  summing  the  hypergeometric  probabili¬ 
ties  of  all  the  tables  yt  G  Ft  f^,  as  shown  in  (2.2).  We 
do  this  enumeration  implicitly  rather  than  explicitly,  by 
representing  the  reference  set  F*  as  a  network  of  nodes 
and  arcs,  and  then  processing  the  network  in  a  recursive 
stage- wise  fashion. 


Network  Representation  of  Ft 
The  network  representation  of  the  reference  set.  Ft,  is 
constructed  in  c  + 1  stages  labelled  0, 1, ...  c,  where  stage 
j  corresponds  to  the  jth  column  of  a  typical  2  x  c  table 
in  Ft.  At  stage  j  there  exist  a  set  of  nodes  of  the  form 
(j,mjt)i  where  each  rrijt  =  I3i=i  J/i*  corresponds  to  one 
distinct  partial  sum  of  the  first  j  columns  of  the  tables 
yt  G  Ft.  Arcs  emanate  from  each  node  (j,  rrijt)  and 
connect  it  to  successor  nodes  of  the  form  {j  + 1,  my+i  t). 
These  successor  nodes  may  be  specified  explicitly  as  the 
set 

C 

R-(i.”*it)  =  {(i  +  max(mjt,mt-  ^  n,t) 

l=j+2 

<  <  min(mjt  +  "j+i.t,  mt)}  .  (3.7) 

Starting  at  stage  0  with  initial  node  (0,0),  and  apply¬ 
ing  (3.7)  successively  to  the  nodes  at  stages  1 , 2, ...  c  —  1, 
we  automatically  end  up  with  the  unique  terminal  node 
(c,  mt).  In  this  construction  each  path,  or  sequence  of 
connected  arcs  of  the  form 

(0,0) (l,mu)  — - ►(c,mt)  (3.8) 

corresponds  to  one  and  only  one  table  yt  G  F*,  with 
yjt  =  rrijt  —  mj~i,t,  for  j  =  1,2, ..  .c.  Thus  the  tables 
in  Fjfc  are  in  one-to-one  correspondence  with  the  paths 
through  the  network. 

To  complete  the  network  representation  we  assign  to 
each  arc 

(i  -  l,m;_i,ifc)  —  {j, rrijt) 

a  rank  length 

fjk  =  Wj{mjt  -  mj_i,t) 
and  a  probability  length 

The  rank  length  of  a  complete  path  of  the  form  (3.8)  con¬ 
necting  the  initial  node  to  the  terminal  node  is  defined 
as  the  sum  of  rank  lengths  of  the  individual  arcs  consti¬ 
tuting  that  path.  Its  probability  length  is  the  product 
of  probability  lengths  of  the  individual  arcs  constituting 
that  path.  The  distribution  of  7*  is  then  the  same  as  the 
distribution  of  rank  lengths  of  all  the  paths  in  F* . 

Backward  Induction  on  Fit 
We  can  obtain  much  useful  information  about  the  dis¬ 
tribution  of  Tt  very  quickly,  by  a  single  backward  pass 
through  the  network  Fjt.  At  any  node  (j, rrijt)  define 
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the  sub-network,  Tic{j,Tnjt),  to  be  the  set  of  all  possi¬ 
ble  paths  from  (j,Tnjk)  to  the  terminal  node  (c,  m*).  In 
other  words  rjfc(j,  m^t)  consists  of  all  possible  values  of 
the  entries  in  columns  (j  -I- 1,  j  -|-2, . . . ,  c)  of  the  2  x  c  con¬ 
tingency  tables  in  F*  whose  first  j  columns  sum  to 
Now  define  the  length  of  the  longest  path  in  rj:(j,  mjt) 
by 

C 

LP{j,mjk)  =  max  {  V  r;*}  ,  (3.10) 

r.O.m,,) 

the  length  of  the  shortest  path  in  rt(j,  by 

C 

SP(j,mjk)  =  min  {  »■'*}  •  (3  *^) 

and  the  sum  of  probability  lengths  of  all  the  paths  in 
by 

C 

TPij,mjk)=  Y.  IIp'*'  (312) 

rnO'.my*)  l=j  +  l 


The  values  of  LP,  SP,  and  TP  can  be  rapidly  ob¬ 
tained  by  backward  induction.  We  illustrate  how  this 
is  done  for  LP.  Set  LP{c,mii)  =  0.  Now  suppose  that 
LP{j  -1- 1,  is  known  for  every  node  at  stage  j  -f  1. 

Move  backwards  to  stage  j,  select  a  node  (i,  mj  *),  and 
compute 


LP{j,mjk)  =  max  {rj+i  k  +  LP{j  +  l.mj+i,*)}  . 
R-O'. »">».) 

(3.13) 

Repeat  this  process  for  every  node  at  stage  j  and  then 
move  back  one  more  stage.  Proceeding  in  this  manner 
we  reach  stage  (0, 0)  having  evaluated  the  LP  values  for 
all  the  nodes  of  the  network.  The  other  nodal  quantities 
may  be  obtained  similarly. 


Processing  F^  in  the  Forward  Direction 
Starting  with  the  initial  node  (0,0),  we  process  the  net¬ 
work  in  the  forward  direction,  stage  by  stage,  in  such  a 
way  that  by  the  time  we  reach  the  terminal  node,  (c,mjt), 
we  will  have  generated  the  desired  truncated  distribu¬ 
tion  n*.  First  we  introduce  some  notation.  At  any  node 
(jf,  mjt)  define  the  sub-network,  Tt(j,  m,*),  to  be  the 
set  of  all  possible  paths  from  the  starting  node  (0,0)  to 
In  other  words,  Tfc(j,  mj/t)  consists  of  all  pos¬ 
sible  values  of  the  entries  in  columns  (1,2, ...j)  of  the 
2  X  c  contingency  tables  in  F*  whose  first  j  columns  sum 
to  mjk.  (Notice  that  this  set  differs  from  Ft(j, m^t), 
which  specifies  the  last  c-j  +  l  columns  of  these  tables.) 
Denote  a  generic  path, 

(0,0)  -» (l,mu)  -♦ - ►  ij,mjk) 


in  TkUt^jk)  by  r.  The  rank  length  of  r  is 

r{T)  =  , 

1=1 

and  its  probability  length  is 

: 

?(’■)  =  11^'*  • 

/=i 

There  will  typically  be  several  paths,  r  €  Ti(j,  m^t), 
each  having  the  same  rank  length,  r(r)  =  u.  Let  c(u)  be 
the  sum  of  probability  lengths  of  all  these  paths.  That 
is, 

=  Y,  P(’')  • 

r(T)=u) 

We  now  provide  a  recursive  procedure  for  processing 
the  network  in  the  forward  direction.  Suppose  we  have 
reached  stage  j  of  the  network  in  such  a  way  that  at  each 
of  its  nodes,  {j,  rrijk),  we  are  carrying  a  set  of  records 

A(i,m;t)  =  {(u,c(u));  u  =  r(T),u  +  LP{j,mj  k)  >  a/t, 

■r  €  Tt(j,mjt)}  . 

The  following  five-step  algorithm  is  used  to  update  these 
sets  and  thereby  move  forward  to  stage  j  +  1. 

Step  1:  Select  a  record  (u,c(u))  G  A{j,mjk). 

Step  2;  TVansmit  a  copy  of  this  record  to  each  succes¬ 
sor  node  (j  -I-  l,mj+i  i),  where  the  successors  are 
identified  by  (3.7). 

Step  3:  At  each  successor  node,  {j  +  l,mj+i  k),  trans¬ 
form  the  transmitted  record  to  (u*,c*),  where  u*  = 
w  +  '■j+i.t.  and  c*  =  c(u)pj+i,t. 

Step  4:  Insert  (u*,c*)  into  A(j  +  l,mj  +  ijt)  as  follows: 

1.  If  u*  +  LP{j+l,Tnj.ki,k)  <  njfc>  drop  this  record 
from  further  consideration,  and  go  to  Step  5. 
Otherwise  continue  with  the  insertion  as  de¬ 
scribed  below.  (The  value  of  LP  is  available 
from  the  backward  induction  on  F*.) 

2.  If  there  already  exists  a  record  (u,c(u))  6 
A(j  +  l,mj+ijfc)  such  that  u  =  u*,  then  merge 
the  two  records  by  replacing  (u,c(u))  with 
(u,c(«)  +  c*)  6  A(i  -t-  l,mj+i,t). 

3.  If  no  record  currently  in  A{j  -f  l,mj  +  ijt)  has 
u  =  u*,  then  augment  A{j  -J-  l,mj  +  ijt)  by 
adding  (u,c(u))  to  it,  as  a  new  record. 
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The  technique  of  hashing  (Sedgewick  1983,  page 
201)  is  used  to  search  for  matches  and  either  merge 
or  augment  records  in  A(j  + 1,  mj+i  t)-  This  ensures 
an  optimum  trade-off  between  efficient  use  of  avail¬ 
able  memory  and  fast  search. 

Step  5:  Return  to  Step  1. 


The  above  5-step  algorithm  continues  until  every  record 
in  A{j,injic)  has  been  processed.  Then  another  node  at 
stage  j  is  selected,  and  all  its  records  are  processed  in 
accordance  with  the  above  5  steps.  When  all  nodes  at 
stage  j  have  been  exhausted,  repeat  Steps  1  through  5 
for  stage  j-f  1.  Starting  with  A(0,0)  =  {(0, 1)}  and  mov¬ 
ing  through  stages  0, 1, . .  .c  —  1  by  repeatedly  carrying 
out  Steps  1  through  5,  we  process  the  entire  rt  network, 
ending  up  at  its  terminal  node  with  the  set  of  records 
A(c,m/t).  These  records  are  really  the  same  as  the  de¬ 
sired  truncated  probability  distribution  fl*,,  except  that 
the  probability  lengths,  c(u),  have  to  be  normalized  by 
dividing  by  their  sum.  That  is. 


fkih) 


c(tk) 


3.1.2  Pairwise  Convolution  of  the  Stratum  Spe¬ 
cific  Truncated  Distributions 

Wc  restrict  our  discussion  to  the  convolution  of  flj  with 
^2-  The  resultant  distribution  may  be  convolved  with 
fia  in  exactly  the  same  manner.  We  can  go  on  with  this 
pairwise  convolution  until  we  obtain  Q. 

First  sort  the  records  of  fij  in  ascending  order  of  /j,  and 
the  records  of  fla  in  descending  order  of  t2.  Set  i  =  1, 
j  =  1 .  Now  proceed  with  the  following  3-step  algorithm; 


Step  3:  Set  i  =  i  -(-  1,  and  return  to  Step  1. 

There  are  many  ways  to  perform  the  convolution  at 
Step  2,  if  the  inequality  (3.14)  holds.  We  use  hashing 
to  club  records  having  the  same  value  of  tj  -t-  to-  The 
details  are  similar  to  Step  4.2  of  the  5-step  algorithm  for 
forward  processing  of  Ft-  A  considerable  efficiency  gain 
is  achieved  because  we  need  not  consider  records  from 
Q2  located  at  positions  j  or  below.  The  inequality  (3.14) 
ensures  that  they  can  never  contribute  to  the  final  set 
of  records  in  fi,  since  the  maximum  to  which  they  could 
be  augmented  is  less  than  qq.  This  is  analogous  to  the 
record  elimination  achieved  at  Step  4.1  of  the  5-step  al¬ 
gorithm  for  forward  processing  of  Fi. 


3.2  Evaluating  ^  and  its  Variance 

To  obtain  the  cmle  for  /?,  we  must  maximize  the  loga¬ 
rithm  of  the  likelihood  (2.4).  Then  the  second  derivative 
of  the  log  likelihood,  evaluated  at  0,  yields  the  desired 
variance.  But  direct  evaluation  of  the  log  likelihood  is 
not  an  easy  task,  given  the  complicated  expression  for 
the  denominator  of  (2.4).  In  fact  if  one  attempted  to 
evaluate  this  denominator  directly,  it  would  require  the 
enumeration  of  all  the  s  2  x  c  tables  in  0.  This  would 
make  the  asymptotic  inference  as  computationally  com¬ 
plex  as  the  exact  inference.  Fortunately  there  is  an  easier 
approach  that  works  well  up  to  extremely  large  sample 
sizes.  Notice  that  the  denominator  of  (2.4)  is  the  same  as 
TP(0,0),  summed  over  all  the  strata.  We  can  easily  set 
up  recursions  like  (3.13)  for  TP,  its  first  derivative,  TP', 
and  its  second  derivative,  T P" ,  and  rapidly  evaluate  all 
three  quantities  during  the  backward  induction 

ofFfc.  For  example. 


Step  1:  Select  record  i  from  Qj.  Denote  it  by 
(t\,fi{t\)).  Select  record  j  from  02-  Denote  it  by 

Step  2:  If 

5 

^1  T  ^2  "F  ^  ^  Ik, max  ^  Uq  , 

*=3 

set  j  =  j  +  \  ,  and  return  to  Step  1.  But  if 

s 

t'l  -t-  <2  +  ^  ^  tk,max  <  Uo  ,  (3-14) 

k=3 

convolve  record  i  from  fii  with  each  of  the  first  j  —1 
records  from  122- 


TP'(j,mjk)=  ^  Pj+i,t[TF(i -F  l,mj  +  ijt)-|- 

TP'(i+l.m2+,,0] 

It  is  easy  to  show  by  successive  differentiation  of  the  log¬ 
arithm  of  (2.4)  that  the  second  derivative  of  the  contri¬ 
bution  to  the  log  likelihood  of  the  /tth  stratum  is 

[TP(0,0)]-2[rP'(0,0)]2  -  [TP(0,0)]-‘[TP"(0,0)] 

(3.15) 

Evaluating  (3.15)  at  the  cmle  of  0,  summing  across 
strata,  and  equating  the  resultant  second  derivative  to 
zero,  yields  the  desired  asymptotic  variance. 
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4  Concluding  Remarks 

The  following  technical  features  of  the  network  algorithm 
were  responsible  for  its  extraordinary  success: 

•  The  network  representation  takes  advantage  of  the 
categorical  nature  of  the  data  by  requiring  only  as 
many  stages  as  there  are  discrete  categories. 

•  The  number  of  nodes  in  the  T*  network  is  deter¬ 
mined  min(mjfc,  mj,).  Thus  the  greater  the  imbalance 
between  the  two  row  sums,  the  smaller  the  network, 
and  the  easier  the  processing. 

•  The  preliminary  backward  induction  pass  through 
the  network  provides  valuable  information  about 
the  ‘future’  for  each  stage  of  the  forward  process¬ 
ing.  This  enables  us  to  generate  a  truncated  per¬ 
mutation  distribution  directly  at  the  forward  pass, 
rather  than  generating  the  full  permutation  distri¬ 
bution  and  then  truncating  it  as  needed.  In  effect, 
substantially  fewer  records  are  carried  along  at  each 
stage  of  the  forward  pass,  as  records  not  satisfying 
the  LP  criterion  get  eliminated. 

•  The  network  representation  enables  us  to  generate 
the  distribution  of  each  T*  recursively  in  a  stage- 
wise  forward  pass  through  the  network.  During  this 
forward  pass  paths  having  the  same  rank  length  up 
to  some  node  are  ‘clubbed’  together.  We  thus  deal 
only  with  paths  having  distinct  rank  lengths  up  to 
each  node,  rather  than  all  the  paths  up  to  that  par¬ 
ticular  node. 

•  The  backward  induction  step  enables  us  to  rapidly 
evaluate  the  denominator  of  (2.4),  and  its  first  and 
second  derivatives.  This  greatly  facilitates  the  con¬ 
ditional  maximum  likelihood  inference. 
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Abstract 

We  review  the  use  of  exact  methods  for  checking  logistic 
regression  models.  We  focus  on  global  model  checks,  out¬ 
lier  detection,  and  goodness-of-link  checks.  We  discuss 
approximations  to  exact  conditional  methods  whenever 
available.  We  also  contrast  exact  conditional  methods 
and  standard  unconditional  methods  based  on  asymp¬ 
totic  approximations.  The  techniques  are  applied  to  two 
examples. 

1  Introduction 

The  generalized  linear  model  (McCullagh  and  Nelder, 
1989)  provides  a  unified  framework  for  analyzing  bino¬ 
mial  response  data.  An  attractive  feature  of  this  ap¬ 
proach  is  that  a  common  collection  of  data  analytic  and 
inferential  techniques  can  be  used  for  logistic,  probit, 
complementary  log-log,  and  other  possible  link  func¬ 
tions.  Maximum  likelihood  is  usually  used  to  fit  gen¬ 
eralized  linear  models  (GLIMs).  For  binomial  response 
models  fit  within  the  GLIM  paradigm,  inferences  and 
model  assessments  are  based  on  large  sample  approxi¬ 
mations.  For  example,  chi-squared  approximations  to 
the  deviance  and  Pearson  statistics  are  used  to  assess 
the  global  fit  of  the  model,  while  normal  approximations 
to  the  deviance  and  Pearson  residuals  are  used  to  check 
for  outliers. 

The  logistic  regression  model  has  a  special  place  within 
the  class  of  binomial  response  models  because  it  is  a  lin¬ 
ear  exponential  family  model.  Hence,  exact  conditional 
methods  for  inference  are  available  in  contrast  to  uncon¬ 
ditional  methods  based  on  maximum  likelihood. 

The  use  of  conditional  distributions  for  exact  inference 
on  logistic  parameters  dates  to  Cox  (1958,  1970).  Hirji, 
Mehta  and  Patel  (1987)  gave  an  efficient  algorithm  for 
computing  exact  tests  of  logistic  regression  parameters. 


Davison  (1988)  discussed  saddlepoint  expansions  for  ap¬ 
proximate  conditional  inference  in  logistic  regression. 

Exact  conditional  methods  can  also  be  used  to  check 
the  logistic  model.  The  distribution  of  the  data  given 
the  observed  value  of  the  sufficient  statistic  for  the  logis¬ 
tic  model  serves  as  the  reference  distribution  for  model 
checks.  Once  the  reference  distribution  is  generated,  spe¬ 
cific  features  can  be  assessed  by  an  appropriate  choice  of 
a  test  statistic.  For  example,  the  global  fit  of  the  model 
can  be  based  on  a  conditional  assessment  of  the  deviance 
or  Pearson  statistic. 

There  are  compelling  reasons  for  basing  model  assess¬ 
ments  on  the  distribution  of  the  data  given  the  observed 
value  of  the  sufficient  statistic.  First,  for  any  test  of 
model  adequacy  the  conditional  distribution  is  function¬ 
ally  independent  of  the  model  parameters.  In  addition, 
the  conditional  approach  uses  the  exact  discrete  distribu¬ 
tion  of  the  data  in  contrast  to  methods  that  assume  con¬ 
tinuous  approximations  to  this  distribution.  For  small 
samples  or  sparse  data  sets  this  exactness  can  be  crit¬ 
ical.  McCullagh  (1985,  1986)  provided  further  support 
for  this  position. 

Bedrick  and  Hill  (1990)  developed  an  algorithm  to 
enumerate  the  reference  distribution  for  checking  logis¬ 
tic  models.  The  enumeration  can  be  computationally 
intensive,  but  is  feasible  for  modestly  sized  data  sets. 

This  paper  reviews  the  use  of  exact  methods  for  check¬ 
ing  logistic  models.  We  focus  on  global  model  checks, 
outlier  detection,  and  goodness-of-link  checks.  We  dis¬ 
cuss  approximations  to  exact  methods  when  they  are 
available.  We  also  discuss  the  corresponding  uncondi¬ 
tional  methods.  The  rest  of  this  paper  is  organized  as 
follows.  Section  2  introduces  notation  and  develops  our 
approach  to  model  checking.  Section  3  discusses  exact 
methods  for  the  three  areas  of  interest.  Examples  are 
given  in  Section  4.  Section  5  suggests  directions  for  fu¬ 
ture  research  and  offers  concluding  remarks. 
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2  Background 

2.1  Notation 

Assume  that  Y  =  (Vi,  is  a  vector  of  inde¬ 

pendent  binomial  random  variables  with  sample  sizes 
and  probability  vector  it  —  (tti,  ...,7r„)'. 
The  mean  vector  and  covariance  matrix  of  Y  are  given  by 
=  (^1.  and  V  =  diag(t;i,  respectively, 

where  /i,  =  m,7r,  and  Vj  =  miirj(l  —  Tj),  i  =  The 

logistic  regression  model  can  be  expressed  as 

logit(Tj)  =  log{7r,/(l  -  TTj)}  =  2-/? 


ya,  =  {y*  =  (yi,  y*  an  integer  : 

0  <  yr  <  and  Z'y'  =  Soj,}. 

Note  that  is  the  set  of  response  vectors  that  give 
the  same  value  of  the  sufficient  statistic  as  the  observed 
data. 

Specific  features  of  interest  are  assessed  by  the  ap¬ 
propriate  choice  of  a  test  statistic,  say  <(Y).  Assuming 
for  the  moment  that  large  values  of  <(y)  call  into  ques¬ 
tion  the  adequacy  of  the  model,  the  significance  level 
associated  with  the  observed  value  of  the  test  statistic, 
<oij  =  t(yoks),  is  given  by 


i  =  1 , n,  where  2,'  is  a  known  1  x  p  vector  of  covariates, 
and  /?  is  apx  1  vector  of  unknown  regression  parameters. 
Using  matrix  notation, 

logit(7r)  =  Zj3  (1) 

where  Z  is  an  n  x  p  full  rank  design  matrix  with  i-th  row 
2-.  Under  model  (1),  5  =  Z'Y  is  sufficient  for  /?.  Let  P 
be  the  maximum  likelihood  estimator  (MLE)  of  (3  under 
model  (1).  Similar  notation  is  used  for  other  MLEs  under 
model  (1);  for  example  (i  is  the  MLE  of  the  mean  vector. 
Finally,  set  hi  =  {),'2,'(Z'1/Z)“*2,-,  i  =  l,...,n. 

The  Pearson  and  deviance  statistics  on  n— p  degrees  of 
freedom  are  given  by  D  =  where 

Xi  =  (Vi  -  fii)^/vi  and 

dj  =  2[yilog(Vi/pj)-(-(m.-Vi)log{(mi-Vi)/(m,-p,)}]. 

2.2  General  comments  on  model 
checking 

The  distribution  of  the  data  pr(y ;  /?),  indexed  by  /?,  can 
be  factored  into  the  marginal  distribution  of  the  suffi¬ 
cient  statistic  5,  and  the  conditional  distribution  of  the 
data  given  the  sufficient  statistic: 


p{tob,)  =  pr{<(y)  >  toh,  I  5  =  Soi,}. 

The  choice  of  a  test  statistic  need  not  imply  that  a 
particular  alternative  model  is  of  interest.  Indeed,  al¬ 
though  the  statistic  created  to  assess  a  feature  of  the 
model  might  be  motivated  by  consideration  of  a  partic¬ 
ular  alternative  model,  the  conclusions  drawn  from  such 
a  test  are  provisional  and  do  not  require  acceptance  of 
that  alternative.  In  this  regard,  the  evaluation  ofp(<o6j) 
is  a  pure  significance  test.  We  are  not  testing  formal 
hypotheses.  The  computational  aspects  are,  however, 
identical  to  those  used  for  exact  conditional  tests. 

Although  the  reference  distribution  (2)  looks  seduc¬ 
tively  simple,  the  elements  of  usually  must  be  enu¬ 
merated  to  check  the  model.  Depending  on  the  number 
of  samples  n  and  the  configuration  of  covariates,  this 
enumeration  can  be  computationally  intensive  (Bedrick 
and  Hill,  1990).  Once  y„i„  is  generated,  however,  imple¬ 
mentation  of  many  model  checks  is  routine. 

3  Conditional  Methods  for 
Model  Checking 

3.1  Global  model  checks 


pr(y  ;0)  =  pr(>^  I  -S')  pr(S;j3). 

Taking  a  Fisherian  approach  (Fisher,  1950),  inferences 
about  /?  are  based  on  pr(5;/?),  while  model  checks  use 
pr(y  I  5).  Letting  Soit  =  Z'yot,  be  the  observed  value  of 
the  sufficient  statistic  for  the  logistic  model,  the  reference 
distribution  for  model  checking  is 


pr(y  —  y  I  —  ^o6a  )  —  Cobs 


(2) 


where 


'^obs  — 


E 

(yL 


A  global  evaluation  of  the  model  can  be  based  on  the 
conditional  probability  of  the  data,  q{y)  =  pr(y  =  y  \ 
S  =  Sobs)-  The  p-value  for  q{y)  is  the  sum  of  the  condi¬ 
tional  probabilities  for  y- vectors  that  are  at  least  as  rare 
as  the  observed  vector  yobs,  that  is,  p{qobs)  =  pr{9(y)  < 
tiyobs)  I  S  =  Sots}-  Alternative  tests  are  based  on  condi¬ 
tional  assessments  of  the  deviance  and  Pearson  statistics. 
P-values  for  these  statistics  are  p(Dobs)  and  p(A^j,);  for 
example,  p{Dobs)  =  pr(I>  >  Dobs  \  S  =  So4,).  Each 
vector  in  yobs  has  the  same  fitted  values  for  the  logistic 
model  (1).  Thus,  once  yobs  is  stored,  these  three  p-values 
are  easily  calculated. 

McCullagh  (1985,  1986)  developed  Edgeworth  approx¬ 
imations  to  the  conditional  significance  levels  for  D  and 
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A".  He  derived  both  a  normal  approximation  and  a 
second-order  skewness  correction  to p{Doba)  and  p(  A'^j^). 
The  approximations  are  relatively  easy  to  program.  We 
refer  the  interested  reader  to  McCullagh’s  papers  for 
the  corresponding  formulae.  The  Edgeworth  approxi¬ 
mations  assume  that  the  number  of  samples  n  is  large, 
but  they  do  not  require  that  the  sample  sizes  rrn  are 
large.  These  approximations  are  ideal  for  studies  involv¬ 
ing  many  small  binomial  samples  because  exact  evalua¬ 
tions  are  often  infecisible  and  the  standard  unconditional 
chi-squared  approximations  to  D  and  assume  that 
each  sample  size  is  large. 

McCulIagh  (1985)  conducted  a  small  empirical  study 
of  the  approximations  to  p(X^^^)  for  sparse  data  prob¬ 
lems.  He  concluded  that  the  normal  approximation  was 
inadequate  and  that  the  skewness  correction  gave  better 
results.  To  the  best  of  our  knowledge,  there  have  been  no 
studies  of  the  accuracy  of  the  Edgeworth  approximations 
to  p{Dob,). 

For  non-replicated  binary  data  (i.e.  m;  =  1  for  all 
i)  the  deviance  is  identical  for  each  observation  in  yoba 
(McCullagh,  1986).  Moreover,  each  observation  in  this 
set  has  the  same  probability.  Consequently,  there  is  little 
information  in  non-replicated  binary  data  concerning  the 
global  fit  of  the  model.  The  diagnostic  power  of  global 
tests  is  also  likely  to  be  limited  when  all  the  sample  sizes 
are  very  small.  In  these  situations,  specific  model  checks 
need  to  be  formulated. 

3.2  Single  degree  of  freedom  checks 

Many  model  checks  can  be  formulated  as  an  exact  test  on 
a  single  logistic  regression  parameter.  To  develop  these 
model  checks,  we  consider  testing  7  =  0  in  the  model 

logit(7r)  =  Z0+  wy,  (3) 

where  tv  is  a  known  n  x  1  vector.  The  sufficient  statis¬ 
tics  under  model  (3)  are  R  =  w'Y  and  S  =  Z'Y. 
One-sided  significance  levels  for  alternatives  7  >  0  and 
7  <  0  are  given  by  pu(robs)  =  Pt(R  >  r^ba  \  S  =  Soba) 
and  PL(roba)  =  pr(f?  <  (  S  =  Soi,,),  respec¬ 

tively.  A  two-sided  significance  level  is  often  defined  to 
be  2min{p/.(r<,i„),p[/(ro6j)}  (Cox,  1970).  Hirji  ei  at.'s 
(1987)  algorithm  can  be  used  to  evaluate  these  tail  prob¬ 
abilities. 

To  emphasize  our  earlier  comments  on  significance 
testing  versus  hypothesis  testing,  consider  the  problem 
of  testing  for  an  increasing  trend  in  the  probabilities 
JT,,  i  =  l,...,n.  The  usual  small  .sample  test  (Gart  et 
al.,  1986;  p.  85)  uses  the  conditional  distribution  of 
R  =  Yij  “5  given  5  =  Yij  >  where  u’l  <  ■  ■  ■<  w„  are 
a  .somewhat  arbitrarily  preassigned  set  of  scores.  Large 


values  of  R  suggest  an  increasing  trend.  This  test  is  for¬ 
mally  equivalent  to  an  upper  one-sided  exact  test  of  zero 
slope  in  the  model  logit(7ri)  =  a  -f-  Wiy,  i  =  1,  ...,n.  One 
would  likely  not  accept  this  as  an  alternative  model  if  the 
data  suggested  that  the  null  model  of  equal  probabilities 
was  implausible. 

Davison  (1988)  derived  double  saddiepoint  approxi¬ 
mations  to  the  tail  probabilities  pui^oba)  and  PLijoha)- 
For  simplicity,  we  will  consider  the  upper  tail  approx¬ 
imation.  Let  Zw  =  [Z,w]  be  the  full  model  (3)  design 
matrix,  and  define  7,  D^,,  and  to  be,  respectively,  the 
MLE  of  7,  the  deviance,  and  the  estimated  covariance 
matrix  of  Y  under  this  model.  The  double  saddiepoint 
approximation  of  pv(roba)  is 

Pu{roba)  «  1  -  ^{x')  -1-  4>{x’){l/c'  -  1/x*),  (4) 

where  x*  =  sign(7)(D  —  ^  and  c*  =  {1  —  exp(— 7)} 

{det(Zu,'KZu,)/det(Z'VZ)}  ^.  Here  ^(^  and  0(  )  are 
the  standard  normal  distribution  function  and  density 
function,  respectively.  The  lead  term  in  (4)  is  the  normal 
approximation  to  the  signed  square  root  of  the  drop  in 
deviance.  This  is  often  used  for  a  large  sample  test  of 
7  =  0.  Davison  also  discussed  the  use  of  a  continuity 
correction. 

Bedrick  and  Hill  (1991)  evaluated  the  accuracy  of 
double  saddiepoint  approximations  to  the  tail  probabil¬ 
ities  pu(roba)  and  Pi,(rot,)  for  several  well-known  con¬ 
ditional  tests.  The  saddiepoint  approximations  were 
extremely  accurate,  except  when  the  data  were  sparse 
or  the  design  matrix  was  highly  unbalanced.  Moreover, 
these  approximations  were  superior  to  Edgeworth  ap¬ 
proximations.  In  our  experience,  the  saddiepoint  ap¬ 
proximations  are  sufficient  for  most  assessments. 

3.3  Local  deviations:  Outlier  detection 

Pregibon  (1981,  1982)  developed  unconditional  methods 
to  detect  outliers  in  logistic  regression  data,  assuming  a 
mean  slippage  model.  This  outlier  model  allows  a  sep¬ 
arate  mean  for  a  potentially  outlying  observation.  For 
the  moment,  we  consider  methods  to  detect  a  single  out¬ 
lier  at  a  designated  observation,  say  j.  Letting  ej  be  the 
n  X  1  indicator  variable  with  j''’  element  equal  to  one 
and  all  other  elements  equal  to  zero,  the  outlier  model  is 

logit(ir)  =  Z/?-|-ej7.  (5) 

A  test  that  the  jth  observation  is  an  outlier  is  found  by 
testing  7  =  0. 

To  test  7  =  0,  Pregibon  suggested  the  score  statistic 
ij  =  Xj /(I  -  hj)  and  the  drop  in  deviance  A;  =  D  - 
Dj,  where  Dj  is  the  deviance  from  the  outlier  model 
(5).  He  found  a  one-step  approximation  to  Aj  that  does 
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not  require  estimated  probabilities  for  the  outlier  model, 
Aj  =  dj/(l  —  hj).  Note  that  t?  and  Aj  are,  respectively, 
standardized  squared  Pearson  and  deviance  residuals. 

The  usual  Xi  approximations  for  these  statistics  are 
questionable  for  small  sample  sizes.  The  inappropriate¬ 
ness  of  the  x?  approximation  for  binary  data  is  clear 
because  the  residuals  have  two  point  distributions  (Jen¬ 
nings,  1986). 

Bedrick  and  Hill  (1990)  discussed  several  exact  con¬ 
ditional  tests  for  a  single  outlier.  Following  the  de¬ 
velopment  in  section  (3.2),  the  statistics  Yj  and  S  are 
sufficient  under  the  outlier  model  (5).  Thus,  signif¬ 
icance  levels  for  one-sided  alternatives  are  given  by 
Pu{yj,ob>)  —  Pl'(^  ^  yj,ob$  I  S  =  Sobi)  and  Pt{yj,ob$) 
—  Pr(^  <  Vj.obs  I  S  =  Sobs),  while  a  two-sided  signifi¬ 
cance  level  is  2min{pL{yj,obs),Pu{yi.ob»)}- 

Another  test  is  based  on  small  values  of  the  conditional 
probability  gj(yj)  =  pr(yj  =  y,  |  5  =  The  cor¬ 

responding  p- value  is  p(qj,obs)  =  pr(yj(yj)  <  qjiyj,obs)  \ 
S  =  Soil).  Alternatively,  we  can  use  conditional  assess¬ 
ments  of  <?,  Aj,  or  Aj. 

All  of  the  test  statistics  considered  here  depend  on  Y 
only  through  Yj  and  5.  Thus,  the  reference  distribution 
for  each  of  the  tests  is 

pr(Yj  =  yj  \  S  =  Sob,),  (6) 

which  can  be  derived  from  the  reference  distribution  (2). 

Conflicting  inferences  from  the  different  statistics  are 
possible  because  the  statistics  measure  extremeness  in 
Yj  differently.  This  makes  the  choice  of  test  statistic  an 
important  issue  which  Bedrick  and  Hill  (1990)  addressed 
in  detail.  For  example,  they  recommended  recentering 
the  statistics  fj  and  Aj  at  the  conditional  mean  of  Yj 
when  the  reference  distribution  (6)  is  multimodal  or  ex¬ 
tremely  skewed.  In  such  cases,  the  recentered  statistics, 
which  approximate  tests  based  on  the  conditional  likeli¬ 
hood  for  7,  behave  like  the  test  based  on  the  conditional 
probability  qj . 

We  view  the  outlier  test  as  a  check  on  whether  yj,oba 
is  inconsistent  with  the  model,  without  reference  to  an 
alternative.  Consequently,  we  prefer  the  probability  test 
statistic  qj  to  the  other  statistics. 

The  saddlepoint  approximation  (4)  based  on  fitting 
the  outlier  model  (5)  gives  estimates  of  pu{yj)  and 
Piiyj)  for  all  yj.  These  estimates  can  be  used  to  ap¬ 
proximate  the  reference  distibution  (6),  and  the  exact 
distribution  of  each  of  the  test  statistics. 

When  the  location  of  the  outlier  is  unknown  Bed¬ 
rick  and  Hill  (1990)  recommended  that  the  minimum 
p-value  (for  a  given  statistic)  be  used  to  indicate  which 
case  might  be  an  outlier.  Evaluating  the  p-value  of  this 
extreme  p-value  statistic  requires  the  entire  conditional 


distribution  (2).  In  addition,  they  suggested  using  a  plot 
of  the  ordered  p-values  versus  their  conditional  expected 
values  together  with  upper  and  lower  bounds.  The  p- 
value  plot  is  a  natural  analog  to  standard  Q-Q  plots. 
They  also  discussed  the  problem  of  detecting  multiple 
outliers. 

3.4  Goodness-of-link  tests 

The  appropriateness  of  the  logistic  link  function  can  be 
cussessed  in  several  ways.  For  simplicity,  suppose  that  we 
are  interested  in  checking  whether  a  specific  alternative 
link  function,  say  the  probit,  provides  a  better  fit  to  the 
data.  Assume  that  the  same  covariates  are  used  with 
both  links.  Let  and  Da  be  the  estimated  mean  vec¬ 
tor  and  the  deviance  under  the  alternative  link.  The  dis- 
crepency  between  the  two  fitted  models  can  be  measured 
by  the  difference  in  deviances;  A^  =  D  —  Da-  This  test 
statistic  is  minus  twice  the  (unconditional)  log-likelihood 
ratio  statistic  for  comparing  non-nested  models.  Large 
positive  values  of  A^  suggest  a  departure  from  the  lo¬ 
gistic  model  in  the  direction  of  the  alternative  link.  A 
conditional  assessment  of  A^  requires  that  be  com¬ 
puted  for  each  vector  in  yob,  ■ 

The  comparison  of  non-nested  models  was  initially 
studied  by  Cox  (1961),  who  developed  large  sample  un¬ 
conditional  tests  based  on  the  likelihood  ratio.  Wahren- 
dorf,  Becher,  and  Brown  (1987)  proposed  the  difference 
in  deviances  for  comparing  non-nested  generalized  linear 
models.  They  assessed  the  significance  of  the  difference 
in  deviances  unconditionally,  using  nonparametric  boot¬ 
strap  samples. 

An  alternative  approach  is  to  imbed  the  logistic  model 
within  a  parametric  family  of  link  functions.  Davison 
(1988)  showed  how  the  saddlepoint  approximation  could 
be  used  to  assess  the  adequacy  of  the  logistic  link  within 
this  framework.  We  refer  the  interested  reader  to  his 
paper  for  details. 

4  Examples 

The  first  example  examines  the  accuracy  of  approxima¬ 
tions  to  exact  conditional  methods.  The  second  example 
illustrates  a  goodness-of-link  check.  We  used  Bedrick 
and  Hill’s  (1990)  algorithm  to  generate  3^oj,  for  these 
examples. 

4.1  Nodal  involvement  data 

Brown  (1980)  discussed  an  experiment  where  53  prostate 
cancer  patients  underwent  surgery  to  examine  their 
lymph  nodes  for  evidence  of  cancer.  The  data  were  used 
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to  develop  a  model  for  predicting  nodal  involvement  (1 
=  evidence  of  cancer,  0  =  no  evidence)  from  5  preop¬ 
erative  binary  prognostic  variables.  The  data  are  given 
in  Table  1;  see  Bedrick  and  Hill  (1990)  for  a  description 
of  the  covariates.  All  of  the  23  samples  are  small,  so 
fisymptotic  theory  does  not  apply  to  residuals  or  outlier 
tests. 


Table  1:  Nodal  involvement  data  with  designated  case 
test  statistics  and  conditional  p-values.  The  covariates 
Zji,  are  given  as  a  binary  string  of  length  5. 


j 

Vjlirij 

hk 

P(<j) 

A; 

P(Aj) 

1 

5/6 

01111 

1.54 

0.39 

1.11 

1.00 

2 

1/6 

00001 

0.08 

1.00 

0.08 

1.00 

3 

0/4 

11100 

2.81 

0.39 

4.86 

0.21 

4 

2/4 

11001 

0.22 

1.00 

0.22 

1.00 

5 

0/4 

00000 

0.22 

1.00 

0.43 

1.00 

6 

2/3 

01101 

0.03 

1.00 

0.03 

1.00 

7 

1/3 

11000 

1.73 

0.35 

1.24 

0.35 

8 

0/3 

10001 

0.75 

1.00 

1.37 

0.61 

9 

0/3 

10000 

0.11 

1.00 

0.22 

1.00 

10 

0/2 

10010 

0.58 

1.00 

1.06 

1.00 

11 

1/2 

01001 

0.00 

1.00 

0.00 

1.00 

12 

1/2 

00100 

4.44 

0.18 

2.54 

0.18 

13 

1/1 

mil 

0.10 

1.00 

0.20 

1.00 

14 

1/1 

non 

0.27 

1.00 

0.48 

1.00 

15 

1/1 

10111 

0.53 

1.00 

0.90 

1.00 

16 

1/1 

10011 

1.15 

1.00 

1.61 

1.00 

17 

0/1 

10100 

0.09 

1.00 

0.17 

1.00 

18 

1/1 

OHIO 

0.47 

1.00 

0.80 

1.00 

19 

0/1 

01100 

0.52 

1.00 

0.86 

1.00 

20 

1/1 

01010 

1.36 

1.00 

1.94 

1.00 

21 

1/1 

00101 

2.14 

0.35 

2.51 

0.35 

22 

0/1 

00011 

1.83 

0.41 

2.24 

0.41 

23 

0/1 

00010 

0.34 

1.00 

0.60 

1.00 

Brown  proposed  a  main  effects  model  for  the  log-odds 
of  nodal  involvement: 

logitfTT,)  =  /?()  +  ih  ^il  +  0'’~i2  +  +  pA~iA  +  /?5'i5, 

i  =  1,...,23.  A  FORTRAN  version  of  our  algorithm 
generated  the  6034  response  vectors  belonging  to  y^obs  in 
13  seconds  on  a  SUN  SPARC  station  IPC  computer. 

'File  deviance  and  Pearson  statistics  for  this  model  are 
l)„^,  =  18.07  and  =  IT). 46  on  17  degrees  of  freedom. 
'Idle  exact  conditional  p-values  for  q(yobs),  H„b,,  and 
X^b,  are  0.758,  0.803,  and  0.792,  respectively,  suggest¬ 
ing  that  the  model  provides  an  adequate  global  fit  to  the 
data.  The  Edgeworth  approximations  to  the  conditional 


p-values  for  Dobs  and  lead  to  the  same  conclu¬ 

sions.  The  first-order  normal  approximations  to  p{Dob$) 
and  p(A'pj,)  are  0.937  and  0.912,  while  the  second-order 
skewness  adjusted  approximations  are  0.846  and  0.828. 
The  second-order  approximations  are  reasonably  accu¬ 
rate. 

Table  1  gives  the  score  statistic  ,  and  its  exact  con¬ 
ditional  p-value  p(/j),  for  each  observation.  Summaries 
for  Aj  are  also  provided.  We  note  that  p(<j)  =  p(qj)  for 
each  observation.  None  of  the  observations  appear  to 
be  unusual.  Note  the  consistency  in  the  p-valut.s  across 
statistics,  even  when  the  magnitudes  of  tj  and  Aj  are 
very  different.  R  eni'^ring  tj  and  Aj  at  the  conditional 
expectation  of  Vj  had  little  effect  on  the  conditional  p- 
values. 

The  saddlepoint  approximation  to  the  marginal  con¬ 
ditional  distributions  pr(4j  =  Vj  I  ■^  =  So6j)  were  very 
accurate.  The  relative  error  in  the  continuity  corrected 
estimates  averaged  3%  across  the  samples.  Moreover, 
the  approximations  to  the  significance  levels  for  tj,  Aj, 
and  qj  were  always  within  0.01  of  the  exact  values  given 
in  Table  1. 

The  exact  conditional  and  approximate  conditional  as¬ 
sessments  indicate  that  the  logistic  model  provides  an 
adequate  global  and  local  fit  to  the  data. 

4.2  A  dose-response  experiment 

Table  2  gives  data  from  an  experiment  designed  to  ex¬ 
amine  the  toxicity  of  a  pesticide  to  a  species  of  Chrysan¬ 
themum  aphis  (Finney,  1947,  p.  69).  We  initially  con¬ 
sidered  a  logistic  model  with  log-dose  8is  the  predictor. 
The  deviance  and  Pearson  statistics  are  Dobt  =  5-96  and 
X^bs  —  degrees  of  freedom.  The  asymptotic 

p-values  for  these  statistics  based  on  a  approximation 
are  both  about  0.20. 


Table  2;  Dose-response  data  (Finney,  1947;  p.  69)  with 
expected  counts  for  logistic  (/i)  and  complementary  log- 
log  (/i"^)  links. 


j 

m; 

Vj 

Log- dose 

P; 

Pf 

1 

47 

7 

0.40 

7.18 

8.64 

2 

46 

22 

0.71 

18.47 

17.36 

3 

46 

27 

1.00 

32.02 

29.85 

4 

48 

38 

1.18 

39.88 

39.33 

5 

46 

43 

1.31 

41.17 

41.99 

6 

50 

48 

1.40 

46.29 

47.80 

The  expected  cell  counts  under  the  logistic  model  are 
given  in  Table  2.  Although  the  observed  count  at  the 
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third  dose  level  appears  to  be  inconsistent  with  the  fitted 
model,  the  discrepency  is  not  significant. 

Morgan  (1985)  suggested  several  alternative  models 
for  these  data.  We  fit  several  models,  of  which  the  com- 
plemetary  log-log  link  provided  the  best  fit.  The  de¬ 
viance  and  Pearson  statistics  for  this  model  are  3.66  and 
3.69  with  asymptotic  p-values  0.455  and  0.449,  respec¬ 
tively.  The  expected  cell  counts  under  this  model  are 
given  in  Table  2.  None  of  the  observations  is  poorly 
fit.  In  comparison,  the  complem.entary  log-log  provides 
a  better  fit  than  the  logistic  at  large  doses,  but  a  some¬ 
what  poorer  fit  at  the  low  doses. 

Although  the  sample  sizes  are  large,  an  exact  analysis 
of  the  logistic  model  is  feasible.  Our  FORTRAN  routine 
generated  the  1496  response  vectors  belonging  to  in 
39  seconds  on  the  IPC.  The  exact  p-values  for  Dobs ,  -^'obs  < 
and  q(yobs)  are  0.383,  0.343,  and  0.447,  respectively.  The 
exact  p-values  for  Dobs  and  -^'obs  approximately  twice 
their  unconditional  p-values.  A  conditional  assessment 
indicates  that  each  observation  is  adequately  fit  by  the 
model. 

To  illustrate  the  goodness-of-link  check,  we  evaluated 
the  exact  distribution  of  Aa  using  the  complementary 
log-log  as  the  alternative  link  function.  The  difference 
between  the  observed  logistic  and  complementary  log- 
log  deviances  was  2.30,  which  corresponds  to  the  91. 6‘* 
percentile  of  this  distribution.  Thus,  the  data  provide 
some  indication  of  a  departure  from  the  logistic  model 
in  the  direction  of  the  complementary  log-log.  As  noted, 
the  two  models  give  different  fits  at  the  extreme  doses. 
Such  differences  are  an  important  consideration  in  model 
selection  when  the  extreme  percentiles  of  a  tolerance  dis¬ 
tribution  are  the  primary  interest.  The  data  suggest  that 
this  issue  should  be  explored  more  completely  before  se¬ 
lecting  the  logistic  model  for  inference. 

5  Discussion:  Potential  for 
extending  current  methods 

The  model  checks  we  described  in  this  paper  are  but  a 
small  sub<;et  of  the  methods  for  which  exact  analyses  are 
theoretically  possible.  For  example,  specific  methods  are 
needed  for  non-replicated  binary  data  because  of  the  ex¬ 
treme  discreteness  of  the  response.  Landwehr,  Pregibon, 
and  Shoemaker  (1984)  and  Fowlkes  (1987)  developed  di¬ 
agnostic  tools  for  binary  data.  Landwehr  ei  al.’s  local 
mean  deviance  plot  asses.ses  the  local  fit  of  the  logis¬ 
tic  model  to  observations  with  similar  covariate  values. 
Discussants  of  this  article  suggested  variations  of  the  lo¬ 
cal  mean  deviance  which  either  group  observations  with 
similar  predicted  probabilities  or  use  analogs  of  linear 


regression  lack-of-fit  tests.  Regardless  of  how  the  data 
are  grouped,  this  is  a  fruitful  approach  because  the  de¬ 
viance  provides  no  information  about  the  global  fit  of 
the  model  to  non-replicated  binary  data.  Fowlkes’s  di¬ 
agnostics  are  based  on  smoothing  the  binary  responses 
to  examine  the  underlying  structure.  In  theory,  all  of  the 
variations  of  the  local  mean  deviance  plot  and  Fowlkes’s 
“smoothed— components  can  be  calibrated  condi¬ 
tionally.  Unfortunately,  the  size  of  the  problem  for  which 
their  methods  are  most  effective  are  beyond  the  capabil¬ 
ity  of  our  current  algorithms. 

The  present  infeasiblity  of  enumerating  yobs  for  large 
data  sets  is  not  the  only  limiting  factor  with  exact  meth¬ 
ods.  We  generated  yobs  for  a  study  with  29  samples  of 
size  two  and  four  binary  covariates,  only  to  find  that 
contained  over  285  million  response  vectors.  Given  the 
size  of  yobs,  certain  exact  evaluations  were  infeasible  so 
we  based  our  assessments  on  a  sample  of  responses  from 

3^ obs  ‘ 

The  implementation  of  conditional  methods  for  model 
checking  would  be  greatly  enhanced  by  the  development 
of  an  efficient  algorithm  to  sin.ulate  from  the  reference 
distribution  (2).  In  the  problem  discussed  just  above, 
we  had  to  generate  yobs  prior  to  sampling.  Several  algo¬ 
rithms  are  available  for  randomly  sampling  2  x  n  contin¬ 
gency  tables  with  fixed  margins,  without  first  generating 
the  population  of  tables.  Note  that  the  set  of  2  x  n  con¬ 
tingency  tables  with  fixed  margins  is  equivalent  to  the 
reference  set  for  checking  a  logistic  model  with  an  in¬ 
tercept  term  only.  The  introduction  of  covariates  to  the 
model  imposes  additional  constraints  on  the  tables.  This 
added  structure  makes  it  difficult  to  project  whether  re¬ 
sponses  from  yobs  can  be  randomly  generated  without 
first  enumerating  this  set.  The  problem  merits  serious 
consideration. 

To  close,  we  believe  that  exact  methods  should  play 
an  important  role  in  future  analyses  of  logistic  regression 
models.  We  optimistically  project  that  advances  in  this 
area  will  continue  due  to  an  increased  interest  in  compu¬ 
tationally  intensive  methods  coupled  with  the  continual 
development  of  more  powerful  computing  algorithms  and 
environments. 
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Using  Gibbs  Sampling  for  Bayesian  Inference  in 
Multidimensional  Contingency  Tables 
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Abstract 

This  paper  discusses  a  method  suggested  by  Epstein  and 
Fienberg  (1991)  for  the  Bayesian  analysis  of  multidimen¬ 
sional  contingency  tables  in  connection  with  the  Gibbs 
sampler  to  calculate  posterior  densities. 

The  method  consists  of  a  two-stage  hierarchical  prior. 
The  first  stage  is  a  Dirichlet  distribution  with  a  loglin- 
ear  reparametrization  for  its  means.  The  second  stage  is 
a  multivariate  normal  distribution  on  the  loglinear  pa¬ 
rameters.  However,  other  distributions  can  be  used  if 
the  Dirichlet-normal  combination  is  not  flexible  enough 
to  accomodate  one’s  prior  beliefs. 

These  prior  distributions  are  useful  when  one  believes, 
with  uncertainty,  in  a  given  loglinear  structure  for  the 
cell  probabilities. 

Key  words:  Contingency  tables;  Bayesian  estimation; 
Dirichlet  prior  distribution;  Gibbs  sampler;  Loglinear 
model;  Maximum  likelihood  estimation  of  Dirichlet  dis¬ 
tributions. 

1  Introduction 

A  new  Bayesian  method  for  the  analysis  of  multidimen¬ 
sional  contingency  tables  was  recently  proposed  by  Ep¬ 
stein  and  Fienberg  (1988)  and  Epstein  (1990).  As  with 
many  other  Bayesian  methods,  ours  uses  the  posterior 
means  of  the  cell  probabilities  to  estimate  these  parame¬ 
ters.  The  focus  on  posterior  means  is  in  part  due  to  the 
importance  of  point  estimation  and  in  part  due  to  com¬ 
putational  difficulties  in  drawing  further  inferences  from 
the  posterior.  The  purpose  of  this  article  is  to  illustrate 
with  an  example  how  to  use  the  Gibbs  sampler  to  com¬ 
pute  estimates  of  the  posterior  densities  that  arise  from 
our  method.  These  density  estimates  are  readily  inte- 
grable  to  compute  posterior  probabilities  and  moments. 
We  introduce  the  esentials  of  the  method  via  a  simple 
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example.  Suppose  our  interest  is  on  inferences  about  the 
array  of  cell  probabilities  6  =  {6ij  }  of  a  2  x  2  contingency 
table  and  suppose  also  that  given  6,  the  observed  counts 
X  =  {xij}  follow  a  multinomial  distribution  M{N,6). 
We  label  the  two  factors  by  1  and  2. 

When  the  data  follow  a  multinomial  distribution  to 
model  prior  beliefs  it  is  common  to  use  the  conjugate 
Dirichlet  prior  D(K,t})  with  density 

where  /?  =  r(A')/nij  and  tj  = 

Before  the  observation  of  x  we  might  believe  with  some 
uncertainty  that  the  two  factors  are  independent.  That 
is,  we  might  believe  that  0  satisfies  Oij  =  Oi^0+j,i  = 
1,2,  j  —  1,2,  with  some  degree  of  uncertainty. 

The  condition  Oij  =  is  equivalent  to 

log  0ij  =  W  -f  Uifij  )  -(-  U2(.j  ) .  ( 1 ) 

with  the  restriction  that  the  term  u  in  this  equation  is 

w=  -logl^expluK.jj-l-Ujii;)),  (2) 

•  J 

so  that  ^ijOij  —  1-  This  normalization  leads  to  an 
equivalent  parametrization  that  uses  the  multivariate 
logits,  i.e.,  if 


then  the  7,y  are  the  multivariate  logits  (see  Leonard  and 
Novick,  1976).  The  parametrization  (1)  and  the  normal¬ 
izing  condition  on  u  are  equivalent  to  reparametrizing 
{7ij  }  using 

lij  =  «I(<J)  + 

Unless  neccessary,  the  remainder  of  this  paper  omits 
explicit  reference  to  the  normalizing  role  of  u.  Thus, 
we  will  simply  speak  of  the  loglinear  parametrization 
log  Otj  =  u  -f  Ui|,j  )  -I-  «2(IJ  ) 
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To  see  that  the  parametrization  (1)  is  equivalent  to 
independence,  substitute  the  value  of  u  back  in  equation 
(1)  to  get 


Hence 


0i^ 


and  6j^  = 


Ej  e"*'.!) 


To  incorporate  in  the  prior  our  uncertain  belief  in  in¬ 
dependence,  Epstein  (1990)  and  Epstein  and  Fienberg 
(1991)  proposed  using  a  loglinear  parametrization  on  the 
Dirichlet  means.  That  is,  to  reflect  the  plausibility  that 
the  cell  probabilities  satisfy  (1)  they  suggest  using 


\ogVij  =  «  +  wuO)  +  “2(ij).  (3) 


with  u  =  -  log(E.,j  exp(Mi(ij)  +  W2(.j))' 

The  index  “1”  in  u\(ij)  indicates  that  this  «-term  de¬ 
pends  only  on  the  index  i.  It  is  more  common  to  omit 
the  indices  on  which  the  «-terms  do  not  depend.  Thus 
often  we  write  and  W2(j)  instead  of  and 
but  the  fact  that  and  {ti2(i;)}  are  arrays  of  the 

same  dimensions  as  {t?;;}  simplifies  many  formulas. 

To  establish  the  connection  with  the  multidimen¬ 
sional  case,  we  note  that  parametrization  (3)  maps 
an  array  {‘Hj]  belonging  to  the  linear  subspace 
^  :  lij  =  Wi(i)  +  W2(;)}  into  the  array 

{exp(7o)/Eijexp(7ij)}. 

The  parametrization  (3)  implies  that 


rii+  - 


and 


e7«^' 


(4) 


Thus,  {ui(jj)}  and  {u2(jj)}  parametrize  the  marginal  ar¬ 
rays  {vi+}  and  {»?+>},  respectively. 

We  follow  the  notation  of  Andersen  (1974)  to  repre¬ 
sent  marginal  tables,  and  the  definition  will  be  recalled 
in  section  2  more  formally.  This  notation  represents  the 
marginal  array  with  entries  rji+  by  where 

y  =  {1}.  The  set  of  factor  labels  Y  indicates  that  ri^ 
depends  only  on  the  index  corresponding  to  factor  1, 
namely  i,  and  that  t]  was  collapsed  over  the  indices  cor¬ 
responding  to  the  factors  not  in  Y ,  namely  j.  We  will 
also  use  products  of  arrays.  Thus,  for  example,  the  prod¬ 
uct  of  7/^'^  and  denoted  by  is  the  array 

whose  (t,j)  entry  is  or,  in  the  usual  notation, 


Vt+'l+j- 

The  parametrization  logrjij  =  u  +  «i(0)  +  “2(t;)  's 
equivalent  to  Tfij  =  0  S 


(see  Albert  and  Gupta,  1982).  However,  the  loglinear 
parametrization  on  the  Dirichlet  means  allowed  Epstein 
and  Fienberg  (1991)  and  Epstein  (1990)  to  extend  the 
method  to  multidimensional  tables. 

If  we  feel  we  cannot  specify  a  value  for  and 
or,  equivalently,  for  and  U2(jj),  then  the  Dirichlet 

distribution  cannot  adequately  represent  our  prior  be¬ 
liefs.  However,  as  Albert  and  Gupta  (1982)  point  out, 
the  Dirichlet  distribution  may  still  be  used  as  the  first 
stage  of  a  two-stage  prior.  With  a  loglinear  parametriza¬ 
tion  for  the  Dirichlet  means  there  are  two  equivalent  al¬ 
ternative  ways  to  complete  the  two-stage  prior.  One  may 
use  distributions  on  the  u-terms  or  one  may  prefer  to 
specify  distributions  on  and  directly. 

The  loglinear  parametrization  will  be  more  useful 
when  analyzing  tables  of  higher  dimensions  where  one 
may  consider  more  complex  loglinear  structures.  As  the 
next  section  explains,  with  loglinear  parametrizations  for 
n-way  tables  one  can  also  specify  the  second-stage  in  two 
alternative  ways,  but  to  use  the  second  one  must  deter¬ 
mine  the  generating  class  of  the  loglineeir  parametriza¬ 
tion  and  use  the  margins  of  ij  given  by  the  generator  as 
parameters  of  the  Dirichlet  distribution. 

The  parameter  A'  governs  the  concentration  of  the 
prior  distribution  about  the  independence  surface 

0  <  <  1,0  <  <  1, 

=  1}. 


In  the  limit,  as  K  — ►  oo,  the  prior,  and  therefore  the 
posterior,  concentrate  all  of  their  mass  on  S- 

When  we  use  a  two-stage  prior  we  obtain  the  posterior 
means 


N 


+ 


K 


N  +  K  N  N  +  K 


M'lipViplx), 


(5) 


which  we  use  to  estimate  0.  The  expectation 
respect  to  the  distribution  induced 
on  and  77!^!  through  equations  (4). 

In  most  practical  situations,  when  K  — *  0  the  poste¬ 
rior  means  e(9ij\x)  converge  to  the  observed  proportions 
Xij/N.  When  K  — ►  00  not  only  the  posterior  distribu¬ 
tion  concentrates  the  all  of  its  mass  on  5,  but  the  pos¬ 
terior  mean  5(^1*)  itself  belongs  to  S.  This  property 
translates  into 

lim  e{9iAx)  =  lim  e(r?,-/^|x)  x  lim  £:(»;,-/^|x)- 

It  shows  that  the  estimates  corresponding  to  increasing 
values  of  K  reflect  an  increasingly  strong  prior  belief  in 
the  plausibility  of  independence  of  the  two  factors  by 
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compromising  between  estimates  obtained  under  a  sat¬ 
urated  model  and  estimates  obtained  under  an  indepen¬ 
dence  model.  Epstein  (1990)  showed  that  this  property 
holds  for  general  loglinear  parametrizations. 

With  this  introductory  example  it  is  now  easy  to  see 
how  our  approach  extends  to  tables  of  higher  dimension. 
If  we  believe,  with  uncertainty,  in  a  given  loglinear  struc¬ 
ture  for  the  cell  probabilities,  we  use  a  two-stage  prior. 
In  the  first  stage  use  a  Dirichlet  distribution  with  means 
having  the  same  loglinear  structure.  In  the  second  stage 
use  distributions,  Gaussian  for  example,  on  the  u-terms 
of  the  loglinear  parametrization. 

In  the  introductory  example  we  speak  of  independence 
being  a  plausible  structure  for  the  cell  probabilities  to 
indicate  that  we  believe  in  independence  only  to  a  certain 
degree.  In  general,  we  will  speak  of  a  plausible  loglinear 
structure  to  indicate  that  we  believe  in  that  structure 
only  to  a  certain  degree. 

In  the  multidimensional  case,  as  K  — ►  oo,  the  prior 
and  therefore  the  posterior  concentrate  all  their  mass 
in  the  subset  of  arrays  t]  defined  by  the  loglinear 
parametrization.  Epstein  (1990)  studied  properties  of 
the  posterior  means  as  estimators  when  the  loglinear 
parametrization  on  is  hierarchical. 

The  next  section  reviews  the  extension  of  the  method 
for  multidimensional  tables  and  the  basic  elements  of 
loglinear  parametrizations. 

Section  3  presents  our  implementation  of  the  Gibbs 
sampler.  The  implementation  requires  finding  maximum 
likelihood  estimates  for  Dirichlet  means  under  a  loglinear 
parametrization.  Subsection  3.1  describes  the  use  of  the 
projection  gradient  method  to  compute  these  maximum 
likelihood  estimates.  Additionally,  section  3  discusses 
a  rejection-acceptance  scheme  to  draw  deviates  from  a 
posterior  distribution  that  does  not  require  the  marginal 
(  predictive  )  distribution.  Section  4  illustrates  the  im¬ 
plementation  of  the  Gibbs  sampler  and  the  method  of 
Epstein  and  Fienberg  (1991)  with  simple  sociological  ex¬ 
ample  concerning  student  politics  and  family  structure. 

2  A  Bayesian  Method  for  Multi¬ 
dimensional  Tables 

In  this  section  we  review  the  method  proposed  by  Ep¬ 
stein  (1990)  and  Epstein  and  Fienberg  (1991)  for  mul¬ 
tidimensional  tables.  We  refer  the  reader  to  Epstein 
(1990)  for  proofs  and  a  detailed  discussion  of  this  sec¬ 
tion’s  results. 

Following  the  notation  of  Andersen  (1974),  consider 

II  factors  or  treatments  labeled  1,2 . n,  with  factor  i 

having  r,  levels.  Define  r,  =  {1,.  and  call  it  the 


set  of  levels  of  factor  i.  The  set  7  =  fi  x  •  •  ■  x  f„,  is 
usually  refered  to  as  the  index  set  or  the  set  of  cells. 

A  selection  of  levels  i  =  (j'l ,  12, . . . ,  in),  a  generic  el¬ 
ement  in  7,  is  often  referred  to  as  the  (*i,  12,  •  •  ■ , in)- 
cell.  One  obtains  a.  x  ■■■  x  r„  contingency  table 
X  =  {ij.i  G  7}  when  N  individuals  are  examined  and 
cross-classified  according  to  the  levels  of  each  of  the  fac¬ 
tors. 

We  shall  assume  that  x  =  G  7}  has  a  multino¬ 

mial  M(N,  {0j})  distribution,  where  6^  is  the  probability 
of  an  individual  being  classified  in  cell  t.  However,  the 
method  easily  adapts  to  other  sampling  distributions, 
such  as  Poisson  and  product  multinomial  (Bishop,  Fien¬ 
berg,  and  Holland,  1975). 

In  the  first  stage  use  a  Dirichlet  D(K,ij)  distribution 
with  density 

=  (6) 

Lil 

indexed  by  »/  =  6  7),  and  where  0  =  tt  , . 

1 1.6/  * 

The  pcirameter  7\  >  0  is  prespecified.  Thus,e(0, |7v, t;)  = 
T),  for  i  G  7. 

Let  w  C  n,  i.e.,  tr;  is  a  set  of  factor  labels.  We  shall 
denote  u^,  the  interaction  parameter  among  the  factors 
in  w.  More  specifically,  the  interaction  is  the  ri  x 
. . .  X  r„  array 

—  {  ^u/(i  I ,  ..i„  )  }  » 

where  the  entries  Uu,(, I  of  depend  only  upon  the 

indices  ij  with  j  G  w.  Often  the  interactions  are  taken 
to  satisfy  the  usual  ANOVA  constraints,  i.e.,  the  sum 
of  the  entries  over  the  levels  of  any  factor 

j  ^  w  IS  zero.  These  constraints  achieve  identifiabil- 
ity  of  the  parametrization.  The  Bayesian  approach  does 
not  require  identifiable  parametrizations  and  therefore 
we  need  not  use  constraints.  Their  use,  however,  is  not 
precluded.  One  should  use  them  whenever  they  facilitate 
producing  a  prior  distribution  reflecting  one’s  beliefs. 

Loglinear  parametrizations  are  usually  used  for  the 
multinomial  parameters.  The  model  defined  by 

uiCft 

is  the  saturated  or  unrestricted  model.  Whenever  a  vec¬ 
tor,  X  say,  appears  as  the  argument  of  a  real  function  of 
one  variable,  /  say,  then  /(x)  shall  stand  for  the  vector 
(/(x,)....,/(x,))'. 

The  entries  of  the  array  uj,  where  0  is  the  empty  set, 
are  all  the  same.  The  term  uj  is  usually  referred  to  as 
the  constant  term. 
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In  the  general  case  we  can  use  the  multivariate  logits 
7,  by  writing; 


The  parametrization  (7)  is  equivalent  to 

u»C 

We  obtain  submodels  by  including  only  some  interac¬ 
tions  in  the  formula  above.  To  specify  which  interactions 
we  include  in  a  submodel,  we  use  a  class  of  subsets  of  h 
which  we  call  A.  For  example,  we  write 

log  0  =  ^  Uu, 

W^A 

to  s|)ecify  a  parametrization  which  only  includes  the  in¬ 
teractions  among  factors  in  te,  with  w  £  A. 

We  are  concerned  with  making  inferences  when  we  feel 
it  is  plausible  that 

log® 

where  >1  is  a  strict  subset  of  h.  To  incorporate  this 
belief  into  the  prior  we  suggest  that  instead  of  using  the 
loglinear  parametrization  on  9  we  use  it  on  the  Dirichlet 
means,  that  is, 

log//=^Uu,.  (9) 

This  restricts  log?;  to  lie  in  a  linear  subspace  M  of 
To  ensure  that  the  parametrization  is  such  that 
=  1,  it  is  necessary  to  assume  that  M  contains 
the  array  1  whose  entries  are  all  1.  In  the  introductory 
example  the  class  A  is  {{1},  {2}}  and  therefore  equation 
(9)  becomes 

log'/t;  =  w  + 

In  the  parametrization  (9)  the  term  Mg  =  {u}  must 
satisfy 

u  =  -log(^exp(  ^  ««,(,)))’ 

iGf 

so  that  =  1.  The  term  ug  in  (8)  must  satisfy 

this  restriction  as  well.  The  restriction  on  Mg  will  remain 
implicit  whenever  we  refer  to  parametrizations  such  as 
those  in  (8)  and  (9). 

In  summary,  we  suggest  a  two-stage  hierarchical  prior. 
The  first  stage  consists  of  settitig  9  ~  D(/\,r/]  where  7/ 
is  parametrized  using  (9)  The  second  stage  consists  of 
setting  distributions  on  the  u-terms  in  (9). 


As  a  consequence  of  using  the  parametrization  (9)  one 
can  specify  a  value  for  tj  by  specifying  values  for  some 
margins  of  t)  .  For  example,  if  log  77,7  =  u-)-Ui(,7)-|-U2(y), 
then  T)ij  =  In  this  fashion  we  specify  the  rc 

values  T}ij  by  specifying  values  for  and  rj^j,  a  total 
of  only  r  +  c  values. 

This  result  extends  to  the  general  case.  When  the  log- 
linear  parametrization  (9)  is  hierarchical  then  t/  is  to¬ 
tally  specified  by  the  value  of  the  margins  , . . . ,  , 

where  {Vi , . . . ,  V't’)  is  the  generating  class  of  the  loglinear 
parametrization.  Therefore,  we  can  implement  the  sec¬ 
ond  stage  either  by  using  distributions  on  the  u-terms  or 
by  using  distributions  on  the  margins  , . . . ,  .  For 

Y  C  n  the  T-margin  ry'’  is  defined  as  being  the  array 
whose  entries  are 

»/.v  =  E 

n\y  i,€f,.j€n\y 

3  Implementation  of  the  Gibbs 
Sampler 

This  section  describes  the  specifics  of  the  implementa^ 
tion  of  the  Gibbs  sampler  for  calculating  the  posterior 
densities  of  the  cell  probabilities. 

We  start  with  a  brief  review  of  the  Gibbs  sampler  and 
refer  the  reader  to  Gelfand  et  al.  (1990)  and  Gelfand 
and  Smith  (1990)  for  a  detailed  description  of  the  use  of 
Gibbs  sampling  in  Bayesian  inference. 

Suppose  that  one  wishes  to  estimate  the  density  [A'] 
of  the  random  variable  A'  assuming  it  is  possible  to  draw 
deviates  from  the  conditional  densities  [A'lV]  and  [V’|A], 
where  Y  is  another  random  variable. 

The  algorithm  consists  of  iteratively  repeating  a  two- 
step  cycle.  Before  starting  one  draws  a  deviate  A'^°' 
from  an  arbitrary  density  [A']o-  Step  one  of  the  cy¬ 
cle  is  to  draw  a  deviate  V'l"  from  [y'|A'*'**].  Step 
two  is  to  draw  A*'’  from  [A'|V''"].  Then  one  first 
replaces  A'^°'  by  A’*"  and  proceeds  with  the  second 
cycle.  A  succession  of  cycles  produces  a  sequence 
(X(i),y(‘)),(A'<2),y(2)),...,(A'C),y(0),....  The  se¬ 
quence  A'^*^  converges  in  distribution  to  X  ~  [X]  and 
y*'^  converges  in  distribution  to  y  ~  [Y], 

Gelfand  and  Smith  (1990)  suggest  building  an  esti¬ 
mate  of  the  density  [A']  as  follows.  Using  the  Gibbs 
sampler,  obtain  m  independent  replicates  (A'}*\  >’/*'), 
....  (Xm\  V'nVV  With  these  deviates  obtain  the  density 
estimate 

rri 

[x]  =  ur'E[A-|v;‘’].  (10) 

j=l 
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We  use  Gibbs  sampling  to  estimate  the  posterior  dis¬ 
tribution  [0\x].  For  the  two-stage  prior  8  ~  and 

77  ~  [77].  We  identify  X  with  8  and  Y  with  77  and  use 
the  following  Gibbs  scheme  to  draw  deviates  from  the 
posterior  [0|a:]: 

do  j  =  1 ,  m 

Set 

do  7  =  1 ,  < 

Step  1:  draw  77  from  [77|6^°^] 

Step  2:  draw  6^'^  from  [0|77,x] 

0(0)  ^  0(1) 

end  do 


end  do 

On  exit,  this  process  has  generated  m  independent  de¬ 
viates  r}^p  ~  [77^*)], j  =  l,...,7n  and  m  independent 

deviates  8^p  ~  j  =  1, . . . ,  m. 

With  these  deviates,  the  density  estimate  in  equation 
(10)  is  a  finite  mixture  of  Dirichlet  densities, 

m 

[0|*]  =  m-‘^(0|77y',x]. 
j  =  i 

It  is  particularly  simple  to  evaluate  the  marginal  den¬ 
sity  estimate  of  a  cell  probability.  For  example,  in  the 
situation  of  the  introductory  example, 

m 

[6I11I*]  =  ^[0n|77j‘\x],  (11) 

j  =  i 

where  [0ii|77,x]  is  the  beta( /\'*77j[ ,  /\'*(1  —  77^]))  density 
with  A'*  =  A(  -t-  A',  77*j  =  Q7711  +  (1  —  q)xii/N,  and 
o  =  A7(Af  +  A). 

Automatically  monitoring  convergence  is  still  an  open 
issue;  at  present  the  best  one  can  do  is  to  prespecify  the 
total  number  of  iterations  t,  say. 

The  distribution  [8\'q,x],  which  is  used  in  step  two 
of  a  Gibbs  sampler  cycle,  is  Dirichlet  with  concentra¬ 
tion  parameter  A*  =  (A  -f  A')  and  cell  means  77*  = 
K/(N  -f-  K)t]i  -I-  1/(N  -H  K)xi.  Drawing  deviates  from 
the  distribution  [e|77,x]  is  straightforward.  We  chose 
to  generate  these  deviates  by  independently  generating 
7,  ~  Gamma(p,,  1)  ,t  G  /  with,  p,  =  ^''*77*,  and  then 
setting  The  joint  distribution  of  {0,} 

is  Dirichlet  D(K" ,  {9,})  with  —  pJK'. 

However,  drawing  deviates  from  the  distribution 
[7,|0''»]  =  [0(«)|77]  [77]/[e'“>  ],  used  in  step  one  of  a  cy¬ 
cle,  is  not  straightforward. 

We  suggest  the  following  adaptation  of  the  rejection 
method  to  sample  from  [77|0*°'].  This  adaptation  uses 


deviates  77  from  [77],  which  are  easy  to  generate,  to  obtain 
deviates  from  [77(0^°^]  =  [0^°^|77]  [*7]/[®^°^]. 

accept  <—  false 

do  while  (  not(  accept  )  ) 

generate  a  deviate  77  from  [77] 
generate  a  deviate  v  from  (/[O,  B] 
if  (  V  <  [0^''^|77]  )  then  accept  *—  true 
end  do 

Above,  B  is  such  that  B  >  [0^°^|77]  for  all  77  in  its 
domain.  It  is  simple  to  show  that  an  accepted  77  is 
a  deviate  from  [»7|0^°^].  An  important  feature  of  this 
approach  is  that  it  does  not  require  the  calculation  of 
or  an  estimate  of  it,  cts  is  sometimes  necessary 
in  some  implementations  of  the  Gibbs  sampler  (  see  for 
example  Gelfand  and  Smith,  1990). 

A  geneneralized  rejection  method  that  uses  an  en¬ 
veloping  function  B{r))  for  77  — *  [r7|0]  may  increase  the 
speed  of  this  algorithm.  At  present,  however,  we  will 
content  ourselves  with  a  boxed  envelop,  the  main  advan¬ 
tage  being  the  ease  of  programming.  Obtaining  a  good 
value  for  B  is  crucial  for  a  good  performance  of  the  re¬ 
jection  method.  The  ideal  choice  is  to  find  77  such  that 

=  max{[e'°>|77]  ;  log  77  =  ^  77,^}, 

UliA 

and  then  take  B  =  [0*°^|i7].  Observe  that  77  — ►  [6'°'|77], 
is  the  Dirichlet  likelihood  function  given  the  data 

The  next  subsection  introduces  a  maximization  proce¬ 
dure  to  find  B.  The  procedure  appears  to  be  fast  enough 
to  use  it  in  combination  with  the  Gibbs  sampler. 

Observe  that  under  the  loglinear  parametrization  7  = 
log 77  =  T  ~  log’)  maximum  likelihood 

estimate  of  7  . 

3.1  Maximizing  the  Dirichlet  Likelihood 

In  this  section  we  briefly  describe  the  “gradient  projec¬ 
tion  method”  and  apply  it  to  maximize  the  Dirichlet 
loglikelihood.  In  addition  to  being  easy  to  implement, 
various  features  of  the  Dirichlet  likelihood  and  loglinear 
parametrizations  make  the  gradient  projection  method 
preferable  to  other  methods.  We  discuss  the  advantages 
of  the  gradient  projection  method  after  introducing  ad¬ 
ditional  definitions. 

Recall  that  a  loglinear  parametrization  for  77  restricts 
log 77  to  lie  in  a  linear  subspace  A/.  The  usual  form 
of  writing  a  loglinear  parametrization  with  u-terms  ex¬ 
presses  7  G  Af  in  terms  of  a  basis  matrix  of  A/,  i.e.,  a 
matrix  B  whose  columns  form  a  basis  for  A/.  When  ex¬ 
pressing  a  vector  7  G  A/  in  terms  of  the  unique  «  such 
that  7  =  Bu,  the  coordinates  of  7t  are  the  u-terms. 
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To  avoid  technical  complications  that  the  restriction 
Hie  /  ~  ^  introduces,  we  redefine  some  functions  of  tj 
as  functions  of  r/t,  t  ^  (fi, . . . ,  r„)  only.  To  this  effect, 
for  rj  given  we  define  fj  as  fj^  =77,,  t  €  /  with  I  = 
f  •  •  • .  '■n)}  as  the  index  set  for  the  vectors  ff. 

Maximizing  the  Dirichlet  likelihood  is  equiva¬ 

lent  to  maximizing 


which  is  a  constrained  maximization  problem.  A  point 
qr  €  M  is  refered  to  as  a  “feasible  point” .  The  gradient 
projection  method  projects  the  gradient  of  the  objective 
function  onto  M  to  increase  the  value  of  G(7)  and  to 
maintain  feasibility  at  the  same  time. 

The  following  is  a  summary  of  the  gradient  projection 
to  solve  the  above  maximization  problem; 


E(fi)  =  K  <r},X>  -  y~]logr(A'77.), 
<6/ 


where  A  =  log©  and  tj  is  given  by  rj^  =  Jji.i  €  /  and 
r„)  =  Except  for  an  additive  constant, 

F(V)  is  log/(T7l©). 

The  Dirichlet  means  corresponding  to  the  multivari¬ 
ate  logits  -f  are  given  by  »}  =  ff('y)  with  = 
exp(7,)/ exp(7,' ),  7  G  /•  To  use  the  parametriza- 
tion  with  the  multivariate  logits  it  is  convenient  to  define 

G(y)  =  E(//('r)).  (12) 

Observe  that  if  we  use  the  parametrization  with  the  u- 
terms,  then  we  may  find  u,  the  m.l.e.  of  u  ,  by  maxi¬ 
mizing  E(u)  =  G(Bu). 

Roughly  speaking,  there  are  three  classes  of  alter¬ 
native  methods  to  maximize  V .  One  possibilty  is  to 
solve  the  equation  JU(u)  —  0,  where  JU  stands  for 
the  array  of  partial  derivatives  of  U .  Typically,  itera¬ 
tive  procedures  to  solve  this  equation  require  updating 
an  estimate  of  the  Hessian  of  U  after  some  iterations. 
On  the  one  hand,  it  is  difficult  to  obtain  formulas  for 
the  second  derivatives  of  U  and  on  the  other,  comput¬ 
ing  second  derivatives  numerically  is  in  general  expen¬ 
sive  and  roundoff  errors  are  difficult  to  control.  Since 
U(u)  =  G[Bu),  this  approach  poses  the  additional  dif¬ 
ficulty  of  explicitly  requiring  a  basis  matrix  for  M . 

An  alternative  is  to  use  a  steepest  ascent  method 
where  at  each  step  there  is  a  unidimensional  search  along 
the  direction  JU(u).  This  alternative  also  requires  a 
basis  matrix.  In  fact,  any  method  that  uses  u  as  the 
variable  of  the  objective  function,  will  require  a  basis 
matrix. 

The  gradient  projection  method  is  preferable  to  these 
alternatives  because  it  does  not  require  estimating  Hes¬ 
sians  or  a  basis  matrix  of  M.  Moreover,  the  gradient 
projection  method  allows  us  to  take  advantage  of  the 
ANOVA-type  parametrization  for  7  to  perform  certain 
computations  more  efficiently. 

To  use  the  gradient  projection  method  we  view  the 
problem  of  maximizing  the  Dirichlet  likelihood  as  the 
problem  of  finding  7  G  M  such  that 

G(7)  =  max{G(7).7  €  M}, 


Step  1  Initialization;  Choose  7o  €  M 
Let  no  =  G(7o) 

Step  2  Compute  dg  =  JGil/o) 

Step  3  Compute  =  Pm  do 

Step  4  Unidimensional  maximization; 

Find  a  >  0  such  that 

G(7o  -I-  aSo))  =  max„>o  G(7o  -I-  aSo)) 

Set  7,  =  7o  + 

Step  5  Convergence  test; 

Let  vi  =  G(7i) 

If  (ni  —  vq)/vo  <  (  then  stop 

else  7o  *-  T] 
uo  Vi 
go  to  Step  2. 


On  exit,  7,  is  such  that  nj  =  G(7i)  is  an  estimate 
of  the  maximum  value  of  G.  Therefore  /(/f(7i)|©)  is  an 
estimate  of  the  maximum  value  of  i(»7l©). 

In  Step  2,  JG(7o)  stands  for  the  array  of  partial 
derivatives  {0G(7o)/d7,, i  G  /}.  It  follows  from  (12) 
that,  for  i  =  (r’l, . . . ,  t’n)  G  /  and  c  =  (ri, . . . ,  r^). 


^(7) 

©7, 


A'7.[(A,  -  Ac  -  -  tJ'(A'7c)})(l  -  r},) 


+  v^(A'7c)})7,']- 

<'€/ 


and, 


FtJcI-  -  Ac  -  -  ip(K7jc}})r),' 


where  is  the  digamma  function  and  A,  =  logfl, ,  r  G  /. 

The  formulas  to  compute  the  projection  Pm  do  in  Step 
3  are  derived  in  a  similar  fashion  to  the  formulas  to 
compute  fitted  values  of  the  ceil  means  in  ANOVA. 
However,  these  formulas  are  not  the  same  because  the 
parametrization  for  7  does  not  involve  the  constant  term 
of  ANOVA  parametrizations. 

The  existence  of  a  in  Step  4  is  guaranteed  by  the  con¬ 
cavity  of  the  Dirichlet  likelihood.  We  used  routine  e04abf 
from  the  NAg  library  for  the  unidimensional  maximiza¬ 
tions.  Although  it  would  take  more  programming,  per¬ 
haps  an  algorithm  that  uses  the  derivativeof  G(7o+otfo) 
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with  respect  to  a  would  be  more  efficient  for  the  unidi¬ 
mensional  maximizations. 

It  is  possible  to  use  other  convergence  tests  in  Step  5. 
Since  our  interest  here  is  not  on  the  maximizer  7,  but 
on  the  maximum  value  G{'y),  it  is  appropriate  to  use 
the  test  in  Step  5  to  ensure  that  on  exit  7j  provides  a 
function  value  Ui  sufficiently  close  to  G(7). 

4  Illustrative  Example 

In  this  section  we  reanalyze  the  2x2  table  given  in  Ta^ 
ble  1  which  classifies  college  students  with  respect  to 
their  political  affiliation  and  their  family  structure  (from 
Ikaungart  1971,  and  analyzed  in  Bishop,  Fienberg  and 
Holland,  1975,  pp  379-380),  and  by  Albert  and  Gupta 
(1984).  We  use  this  data  to  estimate  the  cell  proba¬ 
bilities  using  the  prior  belief  that  the  two  variables  un¬ 
der  study  are  plausibly  independent.  This  is  the  situa¬ 
tion  described  in  the  introduction.  For  illustrative  pur¬ 
poses  we  use  normal  distributions  on  the  w-terms  in  the 
parametrization 

logr;,-^  =  u-f-u,,.)-hU2(j).  (13) 

More  precisely,  we  use 

Stage  I:  6\K,t)  ~  d(/\,r/),  with  rjij  reparametrized  ac¬ 
cording  to  equations  (13). 

Stage  II:  The  uj^i)  are  independent,  i  =  1,2.  The 
uofj)  are  independent,  j  =  1,2,  and  also  indepen¬ 
dent  of  the  =  1,2.  The  distribution  of 

is  A^(^i(i),  cr^^j)  and  the  distribution  of  «2(y) 

To  use  this  prior  density  one  first  specifies  the  param¬ 
eter  vectors  =  (/'i(i).A‘i(2)).  and  <r,  =  (<^i{ i), <ti(2)). 
reflecting  the  user’s  prior  knowledge  about  the  propor¬ 
tion  of  students  in  the  two  political  affiliations  and, 
tij  =  (/^2(i)./^2(2)).  and  <7^  =  (<^2(i).<^2(2))i  reflecting 
the  user’s  prior  knowledge  about  the  proportion  of  stu¬ 
dents  in  the  two  family  structures. 

In  this  example  we  set  n,  —  (.5;  .5),(t,  =  (2.0; 2.0) 
and  H3  =  (.5;  .5),  O’,  =  (2.0;  2.0),  reflecting  a  rather 


I'ahle  1:  Parental  decision  making  and  political  affilia¬ 
tion.  Source:  Braungart(  1971), 


Political  Affiliation 


SDS 

YAF 

Parental 

Authoritarian 

29 

33 

Decision 

Making 

Democratic 

131 

78 

imprecise  belief  about  the  u-terms.  Second,  one  specifies 
a  value  for  the  parameter  A'. 

Albert  and  Gupta  (1982)  and  Epstein  and  Fienberg 
(1991)  computed  the  posterior  means  (5)  for  this  table 
but  they  used  different  distributions  to  reflect  uncertain 
prior  beliefs  about  independence.  In  both  articles  the 
posterior  expectation  of  the  t/’s  were  estimated  using  a 
Monte  Carlo  method. 

Table  2  reports  the  computed  values  for  the  poste¬ 
rior  means  of  each  of  the  cell  probabilities  for  several 
values  of  K  (the  column  headed  by  K  =  00  actually  cor¬ 
responds  to  a  very  large,  but  finite,  value  of  A  ).  The 
estimates  corresponding  to  finite  values  of  A'  reflect  the 
uncertain  prior  belief  in  independence  by  compromising 
between  estimates  obtained  under  a  saturated  model  and 
estimates  obtained  under  an  independence  model. 

Figure  1  reports  reports  estimates  of  the  marginal  pos¬ 
terior  densities  for  each  of  the  cell  probabilities.  These 
estimates  were  obtained  using  formula  (11)  for  the  pos¬ 
terior  density  of  0i  1  and  with  the  obvious  modifications 
for  the  other  cell  probabilies.  We  used  m  =  20  indepen¬ 
dent  replicates  and  each  of  the  replicates  was  generated 
with  t  =  20  cycles  of  the  Gibbs  sampler.  In  addition  we 
computed  these  density  estimates  using  different  values 
of  m  and  t.  On  a  plot  the  resulting  estimates  appeared 
to  be  fairly  similar  for  values  of  t  and  rn  as  low  as  10. 


Table  2:  Computed  values  of  posterior  means  for  differ¬ 
ent  values  of  K 


K 
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100 

200 

400 

600 

1000 

2000 

00 
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.107 

.115 

.119 

,125 

.130 

.126 

.135 

.133 

^12 

.122 

.115 

.110 

.105 

.102 

.098 

.100 

.093 

021 

.483 

.474 

.471 

.469 

.468 

.463 

.453 

.459 

O22 

.288 

.296 

.300 

.302 

.299 

.313 

.312 

.316 

5  Discussion 

This  article  reports  on  an  implementation  of  the  Gibbs 
sampler  to  estimate  the  full  posterior  density  of  the  array 
of  cell  probabilities  of  n-way  contingency  tables  using 
the  method  proposed  by  Epstein  (1990)  and  Epstein  and 
Fienberg  (1991).  One  easily  obtains  estimates  of  the 
posterior  distributions  of  the  individual  cell  probabilities 
as  a  finite  mixture  of  beta  densities. 

Gelfand  and  Smith  ( 1990)  proposed  the  Gibbs  sampler 
as  an  easy  to  implement  algorithm  to  generate  deviates 
from  posterior  distributions.  An  expeditious  implemen¬ 
tation  requires  that  all  necessary  distributions  be  avail¬ 
able  for  sampling.  This  was  not  the  case  in  this  article 
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and  we  expended  some  efforts  to  generate  deviates  from 

To  sample  from  [»7|0]  we  used  a  scheme  that  does  not 
requires  the  marginal  density  \0\,  which  is  often  the  main 
obstacle  to  compute  [t7|0].  The  scheme  uses  the  facts 
that  [t)]  is  available  for  sampling,  that  [d|T)]  as  a  func¬ 
tion  of  f}  can  be  viewed  as  a  concave  likelihood  function 
with  a  unique  maximum.  This  maximum  provides  the 
height  of  a  box  for  a  rejection  sampling  method.  The 
gradient  projection  method  proved  to  be  fast  and  very 
easy  to  program.  We  are  currently  investigating  its  use 
in  m2iximum  likelihood  estimation  for  generalized  linear 
models  and  will  report  on  this  work  elsewhere. 

Our  scheme  to  sample  from  [t}|0]  can  be  used  to  im¬ 
plement  the  Gibbs  sampler  for  a  variety  of  other  prob¬ 
lems  involving  two-stage  priors  where  the  first  stage  is 
the  conjugate  prior  for  the  sampling  distribution  and  the 
second  stage  distribution  is  available  for  sampling. 

Furthermore,  we  feel  that  the  simplicity  of  the  Gibbs 
sampler  warrants  exploring  new  algorithms  to  generate 
deviates  from  distributions  that  thus  far  have  not  been 
available  for  sampling.  For  clarity  we  used  a  simple  2x2 
example  to  illustrate  our  implementation. 

In  higher  dimensional  tables,  it  makes  special  sense  to 
utilize  the  structure  of  rj  in  terms  of  its  marginals  as  part 
of  the  algorithm  and  to  set  up  a  cycle  involving  steps  for 
the  conditional  densities  for  each  of  the  marginals  of  rj 
instead  of  a  single  step  for  [i7|0].  We  hope  to  report  on 
the  details  of  such  an  algorithm  at  a  future  date. 
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Abstract 

Scientific  video,  combining  animated  images  with  sound, 
is  a  powerful  tool  for  understanding  transient  two- 
dimensional  or  static  three-dimensional  data.  Anima¬ 
tion  of  colored  perspective  plots  often  reveals  subtle  data 
characteristics  not  seen  in  a  series  of  static  images.  Us¬ 
ing  the  multidimensional  data  fitting  technique  loess,  it 
is  possible  to  construct  dashed  plots  that  visually  rep¬ 
resent  the  smoothed  approximation  and  its  local  error. 
Sound  can  be  used  to  add  scalar  parameters  like  time 
“tick  marks”  or  the  amplitude  of  an  associated  quantity; 
we  describe  means  of  processing  such  scalar  data  using 
variation-diminishing  splines  and  specifics  of  sound  gen¬ 
eration  including  loudness  equilibration.  We  also  explain 
the  limitations  of  our  techniques  and  suggest  some  exten¬ 
sions. 

1  Introduction 

Our  long-term  research  interests  have  been  in  simulation 
techniques  and  data  fitting.  Measured  or  simulated  data 
often  occur  as  values  sampled  at  scattered  locations  in 
time  and  two  or  three  spatial  variables.  Moreover,  there 
are  often  one  or  more  scalar  parameters  cissociated  with 
such  data  sets,  representing  time,  a  global  error  value, 
an  integral,  and  so  forth.  Such  complex  data  fields  are 
difficult  to  comprehend,  but  we  have  found  that  scientific 
video,  combining  animated  images  with  sound,  is  of  great 
help.  We  will  describe  some  of  the  techniques  we  use 
to  generate  images  and  sound,  trying  to  emphasize  those 
that  have  not  yet  become  common. 

Although  graphics  hardware  has  become  increasingly 
powerful,  renderings  of  complex  scenes  with  proper  shad¬ 
ing  and  lighting  are  still  difficult  to  generate  in  real  time 
for  moderate  cost.  As  a  result,  we  employ  interactive 
techniques  where  they  are  essential  for  data  analysis,  such 
cis  scatterplot  brushing  [2]  and  selecting  a  viewing  per¬ 
spective  or  position  of  a  light  source.  Our  model  has 
been  to  generate  high-quality  rendered  images  that  are 


then  recorded  a  frame-at-a-time  on  NTSC  videotape;  by 
high  quality,  we  mean  anti-aliased  perspective  images  in¬ 
cluding  texture  maps,  lighting  models,  mild  reflectivity, 
and  transparency.  Once  an  animated  sequence  hais  been 
generated  on  videotape,  we  then  construct  a  synchronized 
“sound  track”. 

We  have  presented  some  of  our  basic  image  tools  else¬ 
where  [5].  Our  most  important  tool  is  the  equi-spaced 
color-level  plot  with  either  orthographic  or  perspective 
projections.  Here,  we  will  describe  a  dashed  surface 
plot  that  can  simultaneously  convey  the  shape  of  a  two- 
dimensional  function  and  its  local  error. 

We  also  presented  our  basic  sound  tools  in  [5].  We 
have  used  sound  in  several  forms.  Sound  has  been  most 
useful  to  underscore  the  passage  of  time  in  the  form  of 
beats.  We  have  also  found  it  useful  to  vary  the  pitch, 
volume,  and  tempo  in  order  to  represent  other  scalar 
quantities.  Here,  we  will  describe  means  based  largely 
on  variation-diminishing  splines  for  stretching,  smooth¬ 
ing,  or  compressing  data  to  fit  into  a  prescribed  video 
segment.  In  addition,  we  explain  how  “loudness  equili¬ 
bration”  can  help  make  listener  perception  more  uniform. 

Our  view  is  that  both  the  monolithic  system  and  the 
subroutine  model  of  software  communication  are  inappro¬ 
priate  for  this  application.  It  is  often  the  case  that  data 
must  be  transmitted  between  machines  or  highly  special¬ 
ized  programs.  Hence,  it  is  attractive  to  employ  standard¬ 
ized,  self-descriptive,  ASCII  file  formats  and  use  files  or 
UNIX  pipes  to  transmit  data  between  disjoint  processes. 
All  of  our  tools  use  a  uniform  interface  for  exchanging 
data  [4],  based  on  the  AWK  paradigm  [1].  We  employ  the 
RenderMan  Interface  Bytestream  [12]  (rib)  to  decouple 
the  modeling  and  rendering  taisks  while  preserving  rea¬ 
sonable  generality  in  possible  graphical  techniques. 

The  next  section  (§  2)  presents  our  schemes  for  gener¬ 
ating  images.  In  §  3,  our  techniques  for  generating  and 
manipulating  sounds  are  discussed. 
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This  approach  results  in  a  plot  that  “breaks  up”  where 
the  smoothed  representation  is  not  good;  it  is  the  natu¬ 
ral  generalization  of  a  one-dimensional  plot  that  becomes 
broken  up  into  smaller  and  small  dcishes  cis  the  error  in¬ 
creases.  We  reserve  the  use  of  error  bars  projected  up 
from  the  surface  for  displaying  actual  residuals.  See  Fig¬ 
ure  1. 

3  Sound  for  scalar  parameters 


2  Colored  and  transparent  sur¬ 
faces  for  fields 

The  basic  tool  for  understanding  two-dimensional  images 
is  the  color-level  plot.  This  is  the  natural  generaliza¬ 
tion  of  line-bcised  contour  plots  adapted  to  modern  raster 
graphics  hardware  that  supports  texture  mapping.  Or¬ 
thographic  projection  results  in  a  color-level  plots  that 
are  close  to  traditional  contour  plots  (using  gray-scale  in¬ 
stead  of  color  brings  one  even  closer).  The  eye  does  not 
respond  uniformly  in  wavelength  to  light.  Hence,  it  is  im¬ 
portant  to  choose  colors  that  appear  to  be  equi-spaced. 
This  process  can  be  reduced  to  a  nonlinear  least  squares 
in  an  appropriate  psychophysical  metric  [5]. 

A  flat  orthographic  color-level  plot  is  not  the  most  in¬ 
tuitive  representation  for  a  two-dimensional  data  field, 
though  in  trained  hands  it  is  often  the  most  informa¬ 
tive.  We  have  found  that  a  perspective  projection  helps 
a  great  deal,  particularly  when  the  surface  is  given  spec¬ 
ular  highlights.  The  choice  of  perspective  is  often  best 
made  with  an  interactive  tool;  we  have  described  a  “heli¬ 
copter”  model  elsewhere  [5].  Shading  provides  cues  about 
inflections  and  other  subtle  phenomena. 

Often  a  sequence  of  two-dimensional  data  fields  is  pro¬ 
vided.  The  successive  images  may  represent  data  at  dif¬ 
ferent  times  or  as  a  function  of  another  parameter.  We 
employ  frame-at-a-time  animation  to  generate  a  video 
segment  representing  the  data,  where  each  frame  is  typi¬ 
cally  a  color-level  plot. 

Our  target  medium  is  VHS  videotape  since,  for  now, 
that  is  the  only  universally  presentable  format.  We  have 
attended  many  meetings  where  the  speaker  complained 
that  some  critical  feature  was  impossible  to  see  in  the 
displayed  video  but  was  quite  clear  in  his  lab.  So  we  try 
to  use  all  available  techniques  (such  as  anti-aliasing,  color 
desaturation,  and  motion)  to  make  legible  at  least  the 
most  important  features.  Forcing  ourselves  to  stay  with 
NTSC  resolutions  has  also  helped  us  avoid  the  tendency  to 
add  distracting  dials  and  other  dynamic  icons  to  already 
complicated  images. 

Loess  is  a  general  mechanism  for  computing  a  smoothed 
approximation  to  scattered  data,  which  produces  local 
standard  error  estimates  [3].  One  new  approach  to  view¬ 
ing  a  loess  surface  and  the  local  errors  is  the  dashed  sur¬ 
face.  In  two  dimensions,  the  dashed  surface  is  constructed 
by  dividing  the  domain  into  squares  and  then  trimming 
the  color-level  plot  in  each  square  in  proportion  to  the 
error.  That  is,  look  at  the  a  local  square  patch  of  the  sur¬ 
face.  The  colored  region  of  the  square  is  retracted  from 
the  edges  as  the  error  increases.  This  trimming  of  the 
plot  in  the  square  must  be  done  so  that  the  patch  area 
(not  the  perimeter)  is  inversely  proportional  to  the  error. 


The  use  of  sound  is  still  rather  speculative.  In  fact,  many 
visually  oriented  people  may  question  why  it  should  be 
preferred  over  a  more  complex  image.  We  would  like  to 
argue  the  case  for  sound  to  represent  scalar  parameters 
associated  with  an  animated  sequence  of  data  fields.  Ex¬ 
amples  of  such  scalar  parameters  are  time,  global  (“i  ror 
values,  and  functions  computed  from  tlie  data  fields. 

The  simplest  and  most  effective  application  of  sound  is 
to  denote  the  passage  of  time  in  a  simulation.  Drumbeats 
are  the  natural  sound  to  associate  with  a  discrete  moment 
in  time  and  are  analogous  to  the  tick  marks  found  on  a 
scatterplot.  (The  same  software  that  we  use  in  our  graph¬ 
ics  software  to  pick  “round”  numbers  for  tick  marks  was 
immediately  applicable  to  generating  such  time  beats.) 
We  found  earlier  that  beats  attached  to  a  particular  sim¬ 
ulation  hesitated  at  points  where  time  steps  had  to  be 
repeated.  We  had  overlooked  the  repetitions  of  time 
in  a  laserprinter  plot  of  the  time  progression;  the  beats 
stretched  out  the  time  points  making  it  possible  to  hear 
things  we  had  not  seen  earlier. 

We  have  also  considered  the  use  of  pitch,  volume,  and 
tempo  to  represent  more  complicated  scalar  functions  or 
a  combination  of  functions  [5].  In  one  example,  we  had 
a  scalar  function  of  another  scalar  parameter.  We  varied 
the  pitch  to  represent  the  “independent”  scalar  while  in¬ 
creasing  the  volume  and  repetition  rate  in  proportion  to 
the  “dependent”  variable. 

Our  experience  with  auditory  representations  of  more 
complex  scalars  is  that  they  cannot  replace  line  draw¬ 
ings.  We  have  seen  a  number  of  animations  where  time 
(or  another  scalar)  is  represented  by  an  analog  clock  or  a 
number  displayed  on  the  periphery  of  the  basic  data  field 
display,  such  as  our  color-level  plots.  (Variable  length 
bars  are  a  more  workable  alternative  if  there  are  only  one 
or  two  bars.)  Introducing  extraneous  visual  information 
is  often  distracting  —  the  eye  must  glance  from  the  main 
display  to  the  indicator  and  important  information  may 
be  missed.  If  you  examine  a  line  drawing  of  the  scalar  pa¬ 
rameters  carefully,  then  the  sound  representation  gives  a 
qualitative  impression  of  the  scalar  value  without  the  an¬ 
noying  distraction.  The  other  problem  with  adding  scalar 
values  as  clocks  or  bars  in  the  image  is  the  limitations  of 
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Figure  1;  Qualitative  display  of  standard  errors  by  using  a  daslied  surface. 


NTSC  resolution;  we  have  found  that  the  basic  images 
are  usually  complex  enough  without  trying  to  add  other 
small  auxiliary  features.  (Obviously,  this  latter  comment 
doesn’t  apply  to  high  resolution  displays  but  the  distrac¬ 
tion  comment  does.) 

Sound  suffers  from  some  of  the  same  limitations  as 
color;  it  has  limited  resolution,  is  not  as  familiar  as 
conventional  line  drawings,  and  has  some  hard  pyscho- 
physical  problems.  In  addition,  sound  is  only  applicable 
to  animated  sequences  since  it’s  inherently  transient.  Fi¬ 
nally,  human  beings  can  assimilate  much  more  informa¬ 
tion  visually  than  via  hearing  so  sound  must  be  employed 
in  limited  ways. 

Let  us  consider  some  of  the  parameters  governing  au¬ 
ditory  perceptions  and  how  we  use  them; 

•  pitch;  If  we  constrain  ourselves  to  the  Western  scale, 
we  can  take  half-steps  over  three  to  five  octaves.  Al¬ 
though  the  use  of  major  scales  is  more  pleasing  to 
the  ear,  it  severely  reduces  the  number  of  available 
notes  and,  hence,  the  resolution.  Many  (synthesized) 
instruments  are  incapable  of  generating  notes  over  a 
wide  frequency  domain.  We  have  considered  trills  *-> 
represent  error  bars  but  so  far  have  found  the  tech¬ 
nique  to  be  of  limited  value. 

•  tempo;  By  this,  we  mean  the  duration  and  frequency 
of  striking  notes.  Humans  are  surprisingly  sensitive 
to  variations  in  tempo. 

•  volume;  It  is  possible  to  vary  the  volume  over  a  num¬ 
ber  of  relatively  fine  steps  but  perceptions  are  coarse. 
This  problem  is  made  worse  by  the  fact  that  a  partic¬ 
ular  fixed  amplitude  will  be  heard  to  have  a  different 
“loudness”  as  the  pitch  is  varied. 


•  voice:  W'e  use  this  term  to  mean  instrument.  It  is 
possible  to  distingui.sh  and  follow  the  notes  generated 
by  several  instruments.  We  sometimes  supplement 
this  using  stereo  and  reverberation  to  emphasize  the 
separation  of  voices. 

•  melody;  We  have  considered  transitional  notes  and 
cliording  patterns  to  denote  changes  in  scalar  values, 
but  have  not  be'^n  completely  satisfied  with  the  re¬ 
sults  at  present.  See  [10]  for  a  discussion  of  melody 
versus  chord. 

This  variety  of  knobs  would  in  principle  allow  the  simulta¬ 
neous  presentation  of  many  scalar  variables.  We  have  had 
better  luck  bj  ‘=tead  presenting  only  one  or  two  variables 
and  u.'iing  them  to  control  several  musical  parameters;  the 
redundancy  overcomes  some  of  the  psychophysical  diffi¬ 
culties. 

We  have  built  a  number  of  tools  to  generate  sounds 
bcised  on  our  tensor/scatter  file  format  [d].  The  tools 
can  generate  percussion  to  denote  time,  percussion  to  act 
as  a  counter  of  discrete  events  (heard  in  the  associated 
videotape  to  count  number  of  Newton  iterations),  and 
a  variety  of  more  complex  sounds  with  variable  pitch, 
volume,  and  tempo.  All  of  the  tools  generate  standard 
MIDI  [8]  files  to  drive  our  synthesizer  equipment  [9];  MIDI 
allows  the  specification  of  events  in  time  that  start  (or 
stop)  a  note  with  a  particular  velocity  (roughly  volume). 

We  have  found  that  our  generic  api>roximation  tools 
also  play  a  role  in  sound  generation.  Often  we  will  have  a 
fixed  sequence  of  scalar  values  that  are  uniformly  s|>ari'd. 
Such  a  sequence  may  have  to  be  translated  into  more 
than  one  simitar  sound  sequence  lasting  different  lengths 
of  real  time.  Variation-dimiuishit\g  splines  can  be  used 
to  expand,  compress,  or  resami)le  the  sequence  in  a  man- 
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with  variable  pitch  and  so  forth  to  track  two  or  three 
scalars.  Using  more  voices  for  more  parameters  .seems 
ineffective. 

There  is  a  growing  literature  on  the  subjects  discussed 
here.  For  example,  other  related  papers  in  the  jiroceed- 
ings  that  contain  [5]  are  [6]  and  [7]. 
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Figure  2:  Plot  of  perceived  dB  level  (60  phons)  as  fre¬ 
quency  is  varied. 

ner  suited  to  the  sound  segments  of  different  lengths;  for 
data  with  dramatic  excursions  (outliers),  such  variation- 
diminishing  techniques  also  smooth  the  data  so  the  sound 
is  more  palatable  and,  more  importantly,  easier  to  grasp 
(given  the  limitations  of  the  ear).  For  non-uniform  data, 
least-squares  spline  fits  or  loess  can  be  employed. 

We  have  experienced  difficulties  with  loudness  percep¬ 
tion.  As  mentioned  earlier,  the  human  ear  perceives  notes 
of  varying  pitch  but  equal  amplitude  as  having  different 
loudness  values.  Figure  2  is  based  on  measured  data  [11, 
p.45]  and  clearly  shows  that  low  and  high  frequency  re¬ 
sponse  is  not  flat.  MIDI  defines  middle-C  cis  note  60  so  we 
can  map  note  n  onto  frequency  by 

/  =  2'"-®°^/^-523.25. 

We  can  then  build  a  loudness  compensation  function  by 
using  the  above  formula  and  logarithmic  interpolation  of 
the  measured  data  from  fig.  2.  We  have  done  this  and 
find  that  it  does  improve  volume  perception  —  this  per¬ 
mits  us  to  use  a  wider  range  o*"  frequencies  and,  hence, 
increase  the  resolution  available  with  sound.  The  tech¬ 
nique  should  be  extended  to  include  a  second  variable 
varying  the  amplitude  (other  phon  values).  Other  possi¬ 
bilities  would  be  to  incorporate  alternative  loudness  mea¬ 
sures  like  L  =  ^  where  L  is  loudness  and  7  is  intensity 

and  to  include  temporal  integration  [11]. 

Another  difficulty  is  that  MIDI  only  lets  us  directly  con¬ 
trol  the  “velocity”  and  that  is  only  loosely  related  to  loud- 
ne.ss  for  some  instruments.  Among  the  voices  we  use, 
piano  and  the  xylophone  seem  about  the  most  linear. 

Sound  is  limited  to  providing  a  small  number  of  cues 
for  scalar  parameters.  We  have  found  it  possible  to  mark 
time  with  drumbeats  and  then  use  different  instruments 
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Abstract 

StatLog  is  a  European  Community  (ESPRIT)  funded 
project  which  began  in  October  1990.  There  are  about 
10  academic  and  industrial  partners  involved  in  the 
project.  Its  aim  is  to  complete  an  evaluation  of  the 
performance  of  Machine  Learning  and  Statistical  Algo¬ 
rithms  on  large-scale,  complex  commercial  and  industrial 
problems.  The  objectives  of  the  project  are  threefold: 

1.  to  provide  critical  performance  measurements,  and 
criteria  for  measurement  on  available  Learning  Al¬ 
gorithms  which  improve  confidence  in  full  exploita¬ 
tion; 

2.  to  indicate  the  nature  and  scope  of  next-stage  devel¬ 
opment  which  particular  algorithms  require  to  meet 
commercial  performance  expectations; 

3.  to  indicate  the  most  promising  avenues  of  develop¬ 
ment  for  the  commercially  immature  approaches. 

This  paper  describes  the  project  and  the  progress  com¬ 
pleted  to  date. 

1  Introduction 

In  common  with  other  ESPRIT  projects,  a  consortium 
of  academic  and  industrial  partners  work  together,  with 
different  roles,  towards  a  common  goal.  The  main  goal 
of  this  project  is  in  comparative  testing  of  statistical 
and  logical  learning  algorithms  on  large-scale  applica¬ 
tions  in  classification,  forecasting,  control  and  unsupcr- 
vised  learning.  Other  members  of  the  consortium  in¬ 
clude  the  Universities  of  Strathclyde  (U.K.),  Granada 
(Spain),  Porto  (Portugal),  and  Liibeck  (Germany),  and 
industrial  partners  Daimler-Benz  (Germany),  Turing  In¬ 
stitute  (U.K.),  Brainware  (Germany),  ISoft  (France), 
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Messerschmitt-Bolkow-Blohm  (Germany)  and  the  Insti¬ 
tute  of  Automation  (Germany). 

Most  of  the  algorithms  in  this  project  are  applicable 
to  problems  in  classification,  and  the  main  task  is  to 
run  a  large  experiment  involving  a  balanced  design  to 
measure  the  performance  for  algorithm  x  data  set.  In 
classification  and  forecasting  problems  it  is  fairly  clear 
how  to  measure  performance  whereas  in  control  and  un¬ 
supervised  learning  these  issues  are  still  to  be  finalised. 
In  this  article  we  give  some  examples  of  the  classification 
algorithms  to  be  used  in  this  project. 

In  addition  there  are  a  number  of  algorithms  which 
deal  with  “unsupervised  learning”,  i.e.  methods  which 
look  for  structure  in  the  data.  For  example  ITRULE 
[19],  L-lnduction  [3]  and  some  standard  statistical  meth¬ 
ods  such  as  principal  components  and  projection  pursuit. 
We  mention  this  area  of  work  in  the  section  on  different 
types  of  data.  Finally,  we  discuss  the  procedure  to  ob¬ 
tain  objective  performance  measures,  and  give  the  work 
schedule  of  the  project. 

2  Classification  Algorithms 

Due  to  time  and  resource  constraints,  the  project  will 
use,  wherever  possible,  “off  the  shelf’  packages.  The 
methods  to  be  considered  can  be  grouped  under  a  num¬ 
ber  of  headings: 

2.1  Neural  Networks 

Back-propagation  is  designed  to  overcome  the  limita¬ 
tions  of  the  perceptron  [18].  The  architecture  is  com¬ 
posed  of  an  input  layer,  an  output  layer  and  a  set 
of  internal  “hidden  units”.  We  also  consider  a  faster 
and  more  efficient  variant  known  as  Quadratic  back- 
propagation. 
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Counter-propagation  reduces  the  trsdning  time  for 
back-propagation  by  making  use  of  Kohonen’s  [1 1]  self- 
organising  algorithm  and  Grossberg’s  Outstar  [6]  algo¬ 
rithm. 

2.2  “Classical  Statistics” 

These  are  all  standard  methods: 

Discriminant  analysis 

Logistic  regression 

Multivariate  analysis  of  variance 

for  which  we  will  use  routines  from  SAS,  Splus  and  SPSS. 

2.3  “Modern  Statistics” 

ALLOC80  [7]  is  a  package  which  implements  kernel 
density  estimation  methods  using  real,  integer  or  nomi¬ 
nal  data. 

Polytree  algorithm.  [16]  Belief  networks  are  directed 
acyclic  graphs  in  which  the  nodes  represent  propositions 
or  variables,  the  arcs  signify  direct  dependencies  between 
the  linked  propositions  and  the  strengths  of  these  depen¬ 
dencies  are  quantified  by  conditional  probabilities.  The 
polytree  algorithm  is  used  to  recover  the  graph  represent¬ 
ing  a  probability  distribution  from  a  set  of  examples. 
SMART  [5]  is  a  collection  of  fortran  subroutines  which 
perform  Projection  Pursuit  classification  and  Projection 
Pursuit  regression. 

2.4  Bayesian  Statistics 

A  naive  Bayes  classifier  which  takes  an  encoded  data 
set  and  builds  a  Bayesian  Classifier.  Real  valued  at¬ 
tributes  are  either  given  a  normal  model  or  a  cut-point 
model. 

Helen  uses  Bayes  theorem  without  assuming  indepen¬ 
dence  of  the  attributes.  A  development  of  a  method  used 
for  Galactic  images  [12]. 

IND[2].  A  suite  of  C  software  which  includes  a  CART 
[1]  style  decision  tree  system.  Options  allow  CART 
style  cost-complexity  pruning  by  test  set  or  by  cross- 
validation,  and  a  wide  variety  of  splitting  rules  such  as 
Bayesian,  information  gain  and  GINl  (index  of  diver¬ 
sity)  methods  and  a  Wallace-style  MML  approach  to  cut 
points. 

2.5  Genetic  Algorithms 

A  Classifier  system  invented  by  Riolo  [17]  and  imple¬ 
mented  by  Holland  [9].  This  set  of  algorithms  allows 
learning  to  take  place  in  parallel,  rule-based,  message¬ 
processing  systems.  Such  a  system  contains;  a  classi¬ 
fier  list  containing  condition-action  rules;  a  message  list, 


which  acts  as  a  “blackboard”  or  short-term  memory;  in¬ 
put  and  output  interfaces  with  an  environment.  Learn¬ 
ing  can  take  place  by  competition  between  classifiers, 
dbcovery  of  new  classifiers  and  a  “Bucket-Brigade  Algo¬ 
rithm”  [8] 

2.6  Machine  Learning:  Traditional  and 
Relational 

AlphaGolem  [13]  is  a  first-order  induction  algorithm 
based  on  relative  least  general  generalisation.  This  gen¬ 
erates  rules  from  given  examples,  which  are  then  used  to 
classify  new  examples. 

C4.5  [14]  induces  ci2issification  rules  in  the  form  of  deci¬ 
sion  trees  from  a  given  set  of  examples  which  may  conteun 
unknown  or  noisy  entries. 

Cn2  [4]  is  an  interactive  induction  algorithm  which  gen¬ 
erates  either  rule  sets  (unordered  rules),  or  rule  lists  (or¬ 
dered  rules)  from  examples,  where  each  example  is  a 
set  of  attribute-value  pairs.  It  can  also  determine  the 
accuracy  of  a  set  of  rules  by  applying  it  to  a  set  of  pre¬ 
classified  examples. 

First  Order  Inductive  Logic  (Foil)  [15]  is  a  relational 
machine  learning  algorithm  which  uses  entropy  as  an 
heuristic. 

Cal5  [21]  constructs  decision  trees  in  real-valued  do¬ 
mains.  This  uses  an  automatic  analog-to-digital  trans¬ 
formation.  The  definition  of  interval  (corresponding  to 
discretisation  of  the  attributes)  depends  on  the  classifi¬ 
cation  problem  at  hand  and  on  the  context,  i.e.  on  the 
place  of  the  test  attribute  within  the  tree,  and  must  also 
be  learned.  Instead  of  using  an  entropy  measure,  interval 
formation  is  governed  by  statistical  criteria. 

AC2  includes  an  object  oriented  knowledge  representa¬ 
tion  language.  It  is  an  extension  of  the  decision  tree 
algorithm  ID3  to  cope  with  relational  data.  It  has  a 
graphical  user  interface  and  outputs  decision  trees  and 
rules.  [10] 

CRS  learns  relational  structures  based  on  graph  theo¬ 
retic  measures  [20]. 

3  Data  Sets 

In  this  setting,  many  statistical  experiments  would  use  a 
variety  of  simulated  data  with  known  properties  whereas 
we  are  using  real  data,  some  of  which  has  already  been 
tried  in  machine  learning  problems.  One  of  the  crite¬ 
ria  is  that  the  data  must  be  of  commercial  or  industrial 
strength,  so  “toy”  or  “game”  datasets  have  been  delib¬ 
erately  excluded.  Many  of  the  data  sets  contain  missing 
data  and  have  other  “warts"  associated  with  real  prob¬ 
lems.  We  can  group  the  data  sets  under  a  number  of 
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headings  according  to  the  application,  and  we  give  one 
or  two  examples  of  each. 

3.1  Classification  problems 

These  constitute  the  main  part  of  the  effort,  partly  be¬ 
cause  it  is  clearer  how  they  should  be  evaluated  and 
partly  because  we  can  be  more  confident  of  achieving 
our  stated  aims.  Examples  include; 

Protein  folding.  The  datab^lse  consists  of  examples 
of  protein  primary  and  secondary  structure.  The  aim  is 
to  predict  secondary  structure  from  primary  structure. 
There  are  about  10,000  examples,  each  consisting  of  221 
attributes  followed  by  the  decision  class  which  represents 
alpha-helics,  beta-strands  and  coil/turns. 

Heart  diseases.  Several  databases  concerning  heart 
diseeise  diagnosis  collected  from  various  locations.  There 
are  up  to  76  attributes  including  the  angiographic  disease 
status.  This  data  has  been  used  in  previous  studies  so 
will  provide  external  comparisons. 

Hand* written  digits.  A  16  x  16  array  of  pixels  with 
one  of  256  grey-levels  at  each  pixel.  There  are  10  classes 
(the  digits  0, 1, ...  ,9)  and  2000  examples  for  each  class. 

3.2  Forecasting  or  Prediction  problems 

These  are  typically  short  multivariate  time  series  for 
which  some  Box-Jenkins  methods  have  been  tried,  but 
there  is  interest  in  examining  the  performance  of  ma¬ 
chine  learning  methods.  The  way  in  which  performance 
measures  are  obtained  will  be  similar  to  that  given  be¬ 
low,  but  since  the  outcome  is  real-valued,  the  proportion 
misclassified  will  be  replaced  by  some  other  measure  of 
discrepancy,  such  as  mean  squared  error. 

Car  registration.  Predicting  the  number  of  registra¬ 
tions  for  the  whole  car  market  and  the  heavy  truck  mar¬ 
ket.  There  are  56  examples  constituting  11  predictive 
attributes,  for  example  the  industrial  production  index, 
selling  prices  in  the  retail  trade,  and  the  two  values  to 
be  predicted.  So  far,  standard  Box-Jenkins  methods  and 
regression  analysis  have  been  used,  and  there  is  interest 
now  in  trying  machine  learning  and  neural  net  methods. 
Currency  exchange.  The  goal  is  to  predict  the  US$- 
Sterling  exchange  rate  three  months  ahead  using  current 
(and  previous)  financial  indicators;  for  example  retail 
sales  volume,  output  per  head,  unemployment.  In  all 
there  are  114  attributes  and  141  examples.  The  decision 
“class”  here  is  real-valued. 

3.3  Control  problems 

A  dynamic  model  has  been  used  to  describe  the  control 
of  a  TV  satellite.  There  are  high  requirements  for  fuel 


consumption,  pointing  accuracy  and  positioning  -  one  of 
the  difficulties  in  the  control  task  is  the  high  disturbance 
during  orbit  correction  manoeuvres.  The  model  uses  dif¬ 
ferential  equations  to  generate  a  time  dependent  output; 
typically  the  thruster  exhibits  non-linear  characteristics 
with  time  delays  so  overshoots  need  to  be  kept  to  a  min¬ 
imum.  A  further  difficulty  is  caused  by  fuel  sloshing. 
The  control  system  needs  to  be  stable,  fuel  efficient  and 
have  good  response  times.  It  will  be  some  combination 
of  these  factors  that  will  be  used  in  measuring  the  perfor¬ 
mance  of  the  system.  The  question  arises  as  to  whether 
(and  how)  machine  learning  algorithms  can  be  used  in 
this  process. 

3.4  Structure  problems 

These  are  generally  “unsupervised  learning”  in  that  the 
true  class  is  not  given  in  the  training  data.  In  induc¬ 
tive  protein  structure  analysis  the  problem  is  to  describe 
the  protein  super-secondary  structure  by  clustering  the 
examples.  Each  record  consists  of  30  floating  point  num¬ 
bers  which  are  normalised  values  of  attributes  describing 
a  pair  of  secondary  structures  and  the  relationship  be¬ 
tween  them.  Each  record  is  a  potential  example  of  a 
super-secondary  structure.  The  performance  measures 
for  these  problems  will  again  be  different  to  the  super¬ 
vised  learning  C2ise. 


4  Performance  measures 

The  allocation  of  algorithm/data  pairs  will  be  done  by 
the  University  of  Strathclyde,  who  are  directing  techni¬ 
cal  aspects  of  the  project.  Each  algorithm  will  be  tested 
by  an  “expert  user”  and  a  “naive  user”.  Objective  mea¬ 
sures  of  performance  will  include  processing  time  (for 
the  training  data  and  the  test  data),  storage  costs  for 
the  processing  of  the  data  and  the  consequent  rule,  and 
an  error  rate  -  probably  measured  by  cros.s-validation 
and/or  the  bootstrap.  Subjective  measures  will  include 
ease  of  use,  particularly  as  seen  by  the  “naive  user”,  and 
robustness  to  required  parameter  input. 

The  procedure  is  that  the  data  format  will  be  revealed 
to  the  holder  of  the  algorithm  so  that  it  can  be  modi¬ 
fied  to  read  the  given  format.  The  algorithm  will  then 
be  deposited,  together  with  clear  instructions  for  usage, 
and  the  real  data  will  be  released.  After  the  algorithm 
has  been  run  the  results  will  be  validated  and  checked 
using  the  deposited  algorithm.  In  the  event  that  the  re¬ 
sults  are  radically  different  a  third  party  will  be  asked 
to  adjudicate. 


232  C.C.  Taylor 


5  Timetable 

There  will  be  some  iterations  within  the  testing  process 
as  algorithms  are  weeded  out  and  refined  in  the  early 
stages.  At  present  the  data  sets  and  algorithms  to  be 
used  are  being  finalised.  From  August  1991  those  algo¬ 
rithms  which  are  performing  badly  will  receive  an  early 
warning  and  may  be  modified  or  excluded  from  the  main 
trials.  The  full  comparative  trials  are  expected  to  com¬ 
mence  in  April,  1992  with  all  results  summarised  and 
analysed  by  January,  1993.  The  final  three  months  will 
consider  ways  in  which  the  best  algorithms  can  be  ex¬ 
ploited  and  the  results  will  be  made  known. 

In  tandem  with  the  experimental  findings  there  will  be 
an  effort  to  explain  the  results  in  a  theoretical  context. 
Amonst  other  things,  this  will  determine  whether  the 
methods,  or  merely  the  implementation  of  the  algorithm, 
h2is  led  to  the  results.  A  survey  of  previous  comparisons 
(theoretical  or  experimental)  will  also  be  undertaken. 
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ABSTRACT 

The  performance  of  six  discriminant  methods  is  compared 
on  simulated  data  consisting  of  mixtures  of  continuous, 
binary,  ordinal  and  nominal  variables.  These  methods  are; 
Fisher's  linear  discrimination,  logistic  discrimination, 
quadratic  discrimination,  a  kernel  model,  an  independence 
model  and  the  K-nearest  neighbor  method.  In  this  paper,  the 
simulation  design  was  carrefully  conceived.  The 
independence  model  with  an  association  parameter  performs 
well  and  is  very  robust 

1  INTRODUCTION 

In  practice,  the  application  of  discriminant  analysis 
is  difficult  because  the  underlying  parameters  which  define 
the  population  are  unknown.  The  choice  between  different 
discriminant  analysis  methods  is  hard  specifically  when  the 
variables  are  mixtures  of  continuous  and  discrete  variables. 
Most  of  the  previous  studies  showed  that  the  performance  of 
the  models  are  closely  related  to  the  underlying  distribution. 
In  recent  years,  several  studies  on  the  performance  of 
discriminant  methods  have  been  published  (Titterington  et 
al.  [1981],  Knoke  [1982]  and  Schmitz  et  al.  [1985]).  The 
performance  of  each  discrimination  methods  is  also 
dependent  on  the  design  simulation.  The  usefulness  of  a 
simulation  depends  highly  on  the  quality  of  its  design. 
When  the  objective  is  to  choose  between  methods, 
modeling  any  multivariate  interaction  structure  between  a 
mixture  of  continuous  and  discrete  variables  is  difficult.  If 
the  interaction  design  is  too  "  sophisticated",  it  is  hard  to 
validate  the  model.  In  practical  situations  (for  example,  in 
supervised  pattern  recognition),  we  can  only  observe  the 
interaction  structure  between  two  variables  (in  this  paper  we 
propose  a  simulation  design  which  takes  this  into  account). 
The  main  contribution  here  is  the  explicit  parametrization 
of  the  simulation  design  and  the  consequent  study  of  the 
effect  of  these  parameters  on  each  of  the  different 
discriminant  methods. 

The  simulation  design  is  given  in  section  2.  Section  3  deals 
with  the  six  methods  of  discrimination  considered  in  this 
paper  together  with  appropriate  measures  of  performance. 


The  results  of  the  simulation  study  are  described  in  section 
4.  Finally,  we  conclude  with  a  brief  discussion. 

2  Simulation  DESIGN 

Most  of  the  studies  in  which  discrimination 
methods  have  been  compared  for  mixed  data  deal  with  the 
simulation  of  multinormal  distribution  for  all  variables  and 
with  discretization  of  some  of  the  components.  This  method 
unfortunatly  has  major  drawbacks  (Habbema  et  al.  [1980]). 
In  particular,  the  discretization  of  the  continuous  variables 
is  such  that  it  is  hard  to  make  a  link  between  the 
multinormal  distribution  of  the  continuous  data  and  the 
discrete  distribution  of  the  data  obtained  after  discretization. 
The  simulation  of  a  mixture  of  continuous  and  discrete 
variables  is  still  a  difficult  problem  due  mainly  to  the  lack 
of  a  mixed  distribution  such  as  the  multivariate  normal 
distribution  for  continuous  random  variables  or  the 
multinomial  model  in  the  discret  case.  Here  we  will  use  the 
location  model  to  simulate  a  mixture  of  continuous  and 
discrete  variables  (Knoke  [1982],  Schmitz  cl  al.  [1985]  and 
Krzanowski  [1986]). 

This  study  is  concerned  with  the  discrimination 
between  two  populations  and  four  types  of  variables;  X, 
binary,  Xj  nominal,  X3  ordinal  and  X^ continuous  are 
considered.  Two  sets  of  sample  sizes  are  used; 
(ni=50,n2=50)  and  (ni=25,n2=25).  In  each  simulation,  two 
sets  of  data  are  generated.  The  First  set  is  the  training  set  and 
is  used  to  construct  the  discrimination  rules.  The  second  is 
the  lest  set  (or  validation  set)  and  is  used  to  evaluate  the 
perfonnance  of  the  different  discrimination  rules. 

This  simulation  design  deals  with  six  parameters. 
Parameters  A,  B  and  D  describe  the  distance  between  the 
two  groups,  while  parameters  C,  E  and  F  describe  the 
association  structure  between  the  variables. 

2.1  Interaction  structure  between  continuous 
variables  and  discrete  variables 

Knoke  [1982]  suggested  the  use  of  the  location 
model  (Krzanowski  [1975])  as  a  model  of  interaction 
between  the  continuous  and  the  binary  variables.  The 
distance  between  the  group  means  of  continuous  variables 
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depends  on  the  binary  and  the  nominal  variables.  The 
continuous  variables  X3  and  are  simulated  as  functions  of 
the  discrete  variables  Xj,X2 . 


Iti-lt2=  Y  (2.1), 

(-l)Xi(X,+l) 

Pi-p^=(0.4S)  -  -p  -  +  7  (2.2). 

p,-P2=  (0.45)  (-l)Xi+X2  +  7  (2.3); 

where  p,  and  Pj  are  the  mean  vectors  of  variables  X3  and  X4 
for  group  1  and  group  2  respectively.  The  parameter  F  of 
our  simulation  design  describes  this  interaction  structure; 
F=1  corresponds  to  equation  (2.1),  F=2,3  corresponds  to 
(2.2), (2,3)  respectively.  The  factor  y  of  this  model 
corresponds  to  the  parameter  D  of  our  simulation  design. 
Three  values  are  given  for  7.  The  continuous  variable  X3  is 
then  discretized  using  quartilc  values  to  obtain  an  ordinal 
variable.  Parameter  E  describes  the  covariance  structure  of 
the  two  populations  for  variables  X3  and  X4. 


2.2  Interaction  structure  between  nominal  variables 
and  binary  variables 

Kemps  and  Loukas  (1978]  considered  the  problem  of 
random  vector  generation  using  only  d-tuples  of  non¬ 
negative  integers  and  they  applying  the  inversion  method. 
For  our  purpose,  two  discrete  variables  Xj  and  X2  are 
simulated  as  functions  of  the  groups.  The  level  of 
dependence  is  measured  using  level  of  significance  of  the 
Chi-square  test  because  the  sample  sized  dcpcndancy  of  the 
Chi-square  statistic.  Parameter  A  corresponds  to  the 
dependence  between  Xj  and  the  groups,  parameter  B  to  the 
dependence  between  X2and  the  groups  .  Finally,  C 
corresponds  to  the  dependence  between  variables  Xj  and  X2. 


3THE  DISCRIMINANT  ANALYSIS  METHODS 

Before  describing  the  six  discriminant  methods 
considered  in  this  papier,  it  is  convenient  to  introduce  the 
notation  and  terminology  of  discriminant  analysis. 
Individuals  in  the  study  are  assumed  to  belong  to  one  of  two 
populations  jc,  and  rtj-  probabilities  for 

populations  1  and  2  are  respectively  p(7t,)  and  p(rt2). 


Information  is  available  on  each  individual  in  the  form  of  a 
feature  vector  X  of  length  p.  Two  sets  of  data  are  simulated. 
On  the  first  set,  a  discriminant  rule  is  a  set  up  for  assigning 
an  individual  to  one  of  the  two  outcome  categories  given 
the  feature  X  appropriate  to  that  individual.  In  general,  a 
discriminant  rule  will  be  a  procedure  for  obtaining  the 
posterior  probabilities  of  the  form 
P(X/7ii)  P(ni) 

P(«i/X)  =  p^x/rt,)  P(n,)+P(X/n2)  P(7t2) 


where  P(7tj)  is  the  prior  probability  of  group  j  and  P(X/jti) 
is  the  probability  of  observing  the  feature  vector  X  for 
group  i.  Different  choices  for  P{XIk)  lead  to  different 
discriminant  analysis  methods. 


Under  the  assumptions  of  multinormality  of  the 
density  functions  with  equal  or  unequal  covariance  matrices, 
the  linear  or  quadratic  discriminant  analysis  are  noted  by 
PLDA(XI’ti).  respectively  PgaAlXIitj) . 

If  the  density  function  PlX/n^)  is  estimated  by  the 
non-parametric  kernel  method  (KER),  we  willuse  the 
density  function  as  given  by  Habbema  et  al.  [1978,a].  The 
smoothing  parameters  are  estimated  by  the  maximization  of 
the  modified  likelihood  function  (according  to  the  leaving- 
one-out  method). 

The  logistic  model  (LOG),  proposed  by  Day  and 
Kerridge  [1967]  takes  the  parametric  form  Plog 
parameters  are  estimated  by  the  maximum  likelihood 
method. 

The  independence  model  (IND)  assumes  independence 
between  the  variables  and  deals  only  with  discrete 
variables.The  density  is  estimated  by 

PB,D(VK,)ai  nPi(Xi))i>={  n 

k=l  k=l  "i  "^^k 

where  C]^  is  the  number  of  categories  of  variable  Xj^, 
njfXjj)  is  the  number  of  elements  with  score  Xj^  on  variable 
k,  n;  is  the  sample  size  of  group  i  and  p  denotes  an  overall 
association  parameter  representing  the  "proportion  of 
redundant  information"  between  the  variables  (see  Hilden  et 
al.  [1978]).  To  use  this  model  here,  we  discretize  the 
continuous  variable  using  the  quartiles  of  its  distribution. 

Fix  and  Hodges  (1951]  introduced  the  K-nearest 
neighbor  method  (KNN).  The  basic  idea  is  to  classify  an 
individual  into  the  population  whose  sample  contains  the 
majority  of  'nearest  neighbors’.  The  density  is  estimated  by 

PKNN(X/"i)  = 

where  K  is  the  number  of  samples  in  the  hypershere  r(X) 
centered  at  X,  and  V  is  the  volume  of  the  hypershere  r(X). 
Enas  and  Choi  [1986]  investigate  the  sensitivity  of  this 
method  to  the  choice  of  K.  They  suggest  choosing  K  as  a 
function  of  the  sample  size  and  of  the  covariance  matrix 
structure  and  propose  K  =  5. 

3. 1  Performance  measures 

Three  measures  of  the  ability  to  discriminate  will 
be  used  to  evaluate  the  performance  of  the  six  discriminant 
analysis  methods.  The  performance  of  classification  rules  is 
often  measured  by  estimating  the  error  rates  (i.e.  the 
percentage  of  misclassified  cases).  Here  we  used  the  three 
measures:  the  percentage  of  missclassified,  the  quadratic 
score  and  the  logarithmic  score  (see  Habbema  et  al. 
[1978,b].  We  will  also  compare  the  posterior  probabilities 
obtained  from  the  validation  set  for  each  discriminant 
analysis  method. 

4  RESULTS 

The  performances  with  respect  to  the  simulation 
design  parameters  arc  compared  for  the  two  sample  sizes. 
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4.1  Classification  between  the  models 

The  rank-order  score  introduced  by  Schmitz  cl  al. 
[1983]  is  used  for  rank-order  analysis  of  the  scores  on  the 
three  performance  measures  :  the  error  rates,  the  quadratic 
score  and  the  logarithmic  score.  For  each  situation  and  each 
performance  measure,  the  method  with  the  best  score  gets 
rank  1,  and  the  worst  gets  rank  6.  Taking  the  average  over 
all  situations,  the  results  are  given  in  Table  4.1. 

The  KER  method  seems  to  be  the  best  method.  When 
the  sample  sizes  are  reduced,  the  LDA  and  LOG  models  are 
better  than  the  IND  model.  Obviously,  the  parametric 
models  (LGAT-OG,QDA)  seem  less  affected  by  the  sample 
size. 

4.2  Performances  of  the  methods  with  respect  to 
the  simulation  parameters 

Further  analysis  of  each  analysis  of  variance  table 
revealed  that  the  quadratic  score  is  the  most  representative 
measure  in  terms  of  effect  and  interaction  factors.  Therefore, 
the  quadratic  score  has  been  used  to  illustrate  the  results. 

Table  4.2  shows  the  quadratic  score  for  the 
remaining  five  methods  and  for  parameters  A,B,C  and  D. 
The  performance  improves  in  general  with  increasing 
dependence  between  the  discrete  variables  and  the 
discrimination  groups  (parameters  A  and  B),  except  for  the 
QDA  and  IND  models.  This  improvement  in  performance  is 
not  linear  with  the  level  of  dependence.  We  observe  the 
converse  for  the  QDA  model. 

The  association  parameter  between  the  binary  and 
nominal  variables  C  has  an  unexpected  result  on  the 
performance  of  the  IND  model.  The  performance  of  this 
model  increases  with  the  level  of  dependency  of  these  two 
variables  and  the  performances  of  the  LDA  and  LOG 
models  decrease  with  this  parameter.  The  QDA  model 
performs  worse  than  the  LDA  and  LOG  models  when  C=2 
and  the  converse  is  observed  when  the  dependency  is  high 
(C=3). 

The  performance  of  all  models  increases  with  the 
distance  parametci  D.  When  the  distance  is  important  (D=3), 
the  LOG  and  the  LDA  models  are  superior  to  the  QDA 
model.  This  result  follows  from  the  fact  that  the  parameter 
of  dispersion  E  has  less  impact  when  the  distance  parameter 
D  is  important . 

The  QDA  model  is  the  most  perturbed  by  the 
parameter  of  dispersion  (E)  for  the  ordinal  and  continuous 
variables.  We  note  among  other  things  that  the 
classification  rate  changes  with  the  parameters  E  and  F. 
When  the  covariance  matrices  are  equal  ,  the  LDA  ,  LOG 
and  KER  models  have  the  best  performance  (E=l).  But  for 
(E=2),  the  QDA  model  has  a  belter  performance.  The  IND 
model  improves  significantly  when  (E=3)  over  the  QDA 
model. 

The  global  association  model  parameter  F  yields 
almost  the  same  results  as  parameter  E.  The  IND  model  is 
less  affected  by  this  parameter  than  the  other  models. 


4.3  Reliability  of  the  reported  error  rates 

In  this  section,  we  study  the  reliability  of  the 
reported  error  rates  forthe  six  discriminant  analysis 
methods.  We  compute  the  bias  (also  called  apparent  bias) 
for  each  method.  This  bias  corresponds  to  the  difference 
between  the  error  rale  obtained  from  the  training  set  and  the 
error  rale  obtained  form  the  testing  set.  The  results  on  this 
bias  are  given  in  Table  4.3.  The  smallest  bias  corresponds 
to  the  most  reliable  and  the  largest  bias  corresponds  to  the 
least  reliable.  The  most  reliable  error  rate  is  associated  with 
the  LDA  model;  the  least  reliable  with  the  KER  method. 

4.5  Summary  of  the  results 

These  results  may  be  summarized  as  follows: 

1)  Previous  studies  showed  comparable  performances  of 
the  LDA  and  LCXj  models  (Titterington  el  al.  [1981], 
Schmitz  cl  al.  [1983],  Schmitz  et  al.  1985]).  We  obtain 
similar  results  for  ihe.se  two  models. 

2)  The  difference  in  performance  between  the  training 
samples  of  sizes  25  and  50  is  very  small. 

3)  The  KER  model  showed  the  best  ability  to 
discriminate,  but,  in  terms  of  reliability  of  reported  error 
rates,  it  is  the  poorest.  It  is  also  the  most  computer 
intensive  (ten  times  the  computing  lime  of  the  other 
methods). 

4)  In  agreement  with  earlier  results  (Titterington  et  al. 
[1981]),  the  LDA  and  LOG  models  have  remarkably 
reliable  reported  error  rale. 

5)  The  IND  model  with  a  good  association  parameter 
yields  better  results  than  the  LDA  and  QDA  models. 

6)  The  IND  model  seems  less  affected  by  the 
discretization  of  the  two  continuous  variables. 

5  DISCUSSION 

Surprisingly,  the  performance  of  the  IND  model 
seems  less  affected  by  the  association  parameters  (C,E,F) 
than  the  other  methods.  The  reliability  of  the  reported  error 
ratec  for  tins  iin>'^':l  is  superior  to  that  of  the  QDA  and 
KER  models  although  those  models  arc  supposed  to  be  less 
sensitive  to  the  a.ssociaiion  parameters. 

Based  on  this  study  and  on  the  recent  literature  on 
this  topic,  we  have  the  following  suggestions.  When  the 
di.scrimination  is  made  for  exploratory  purposes  (Schmitz  et 
al.  [1985]),  we  propose  the  use  of  a  'master'  computer 
programme  containing  all  methods.  The  KER  and  KNN 
models  arc  too  expensive  in  computing  time,  and  can  be 
excluded  from  this  subset  when  the  number  of  variables  on 
the  sample  size  is  large.  In  this  case,  the  LDA,  LOG  and 
QDA  models  can  be  used  simultaneously  and  some  derived 
methods  (such  as  LDA  augmented)  can  also  be  applied  to 
improve  the  first  results.  On  the  other  hand,  when  the 
object  of  the  discrimination  is  to  built  an  automatic 
classification  system  (as  in  clinical  trials),  we  propose  the 
IND  model  bccau.se  it  is  the  most  flexible  model  for  the 
choice  <>r  the  best  suKset  of  variables.  The  simplicity  and 
the  gotxl  performance  of  this  model  will  make  it  acceptable 
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\ 


when  the  set  of  predictor  variables  is  not  fixed  and  the 
sample  size  is  large. 
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Table  2.1  Parameter  E  :  Covariance  matrix  for  each 


population 


Parameter 

Covariance  matrix 

Covariance  matrix 

of  population  7C] 

of  population  112 

ri.o  O.Ol 

^  Ti.o  o.ol 

E=1 

= 

Lo.o  i.o_ 

^2  -  [0.0  l.oj 

fl.o  0.5l 

„  fl.o  -O.Sl 

E=2 

= 

Lo.5  1.0_ 

^2=L-0.5  l.oj 

ri.o  O.Ol 

^  r3.0  o.ol 

E=3 

= 

.0.0  1.0. 

^2  -  Lo.o  3.0J 

Table  4.1  Classification  score  of  the  methods  with  respect 
to  sample  size _ 


KER 

IND 

ODA 

LDA 

LOG 

KNN 

(25.25) 

2.34 

3.44 

3.80 

3.09 

3.38 

4.91 

(50.50) 

2.01 

3.09 

3.60 

3.45 

3.72 

5.10 

Table  4.2  Quadratic  score  (ni  =50.n7=50) 

KER 

IND 

QDA 

LDA 

LOG 

KNN 

A=1 

.226 

.227 

.217 

.226 

.249 

197 

A=2 

.216 

.210 

.209 

..216 

.271 

.189 

A=3 

.207 

.206 

.202 

.207 

.270 

.178 

B=1 

.223 

.221 

.212 

.223 

.249 

.190 

B=2 

.214 

.209 

.208 

.214 

.271 

.191 

B=3 

.212 

.213 

.209 

.212 

.270 

.186 

C=1 

.213 

.212 

.212 

.213 

.249 

.190 

C=2 

.217 

.220 

.211 

.217 

.271 

.191 

C=3 

.219 

.210 

.206 

.219 

.270 

.186 

D=1 

.228 

.220 

.219 

.228 

.249 

.197 

D=2 

.219 

.215 

.211 

.219 

.271 

.186 

D=3 

.202 

.208 

.199 

.202 

.270 

.181 

E=1 

.217 

.237 

.219 

.217 

.249 

.202 

E=2 

.215 

.187 

.217 

.215 

.271 

.175 

E=3 

.217 

.218 

.193 

.217 

.270 

.190 

F=1 

.218 

.231 

.212 

.218 

.249 

.210 

F=2 

.208 

.197 

.206 

.208 

.271 

.174 

F=3 

.223 

.215 

.210 

.222 

.270 

.183 

Tabk4.J 

Reliability  of  the  reported  error  rates 

KER 

IND 

ODA 

LDA 

LOG 

KNN 

Biais 

.138 

.047 

.107 

.026 

.027 

.072 
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Abstract 

Based  on  CART,  we  introduce  a  recursive  partitioning 
method  for  high  dimensional  space  which  partitions  the 
data  using  low  dimensional  features.  The  low  dimen¬ 
sional  features  are  extracted  via  an  exploratory  projec¬ 
tion  pursuit  (EPP)  method,  localized  to  each  node  in 
the  tree.  In  addition,  we  present  an  exploratory  split¬ 
ting  rule  that  is  potentially  less  biased  to  the  training 
data.  This  leads  to  a  nonparametric  classifier  for  high 
dimensional  space  that  has  local  feature  extractors  opti¬ 
mized  to  different  regions  in  the  input  space. 

1  Introduction 

Due  to  the  curse  of  dimensionality  (Bellman,  1961)  it 
is  desirable  to  extract  features  from  a  high  dimensional 
data  space  before  attempting  a  classification.  This  may 
be  done  in  those  cases  where  the  important  structure 
is  assumed  to  lie  in  a  low  dimensional  subspace  of  the 
original  data.  The  most  well  know  method  for  extract¬ 
ing  features  is  principeil  components,  however  it  has  been 
argued  that  these  features  may  not  retain  the  structure 
needed  for  classification  (Duda  and  Hart,  1973;  Huber, 
1985).  A  more  general  and  powerful  method  for  feature 
extraction  is  Projection  Pursuit,  and  its  unsupervised 
version  -  Exploratory  Projection  Pursuit  (Friedman  and 
Tukey,  1974;  Friedman,  1987).  This  method  has  been  ex¬ 
tended  in  various  directions,  and  is  reviewed  in  (Huber, 
1985). 

One  of  the  advantages  of  EPP  is  the  use  of  locally 
smooth  objective  functions  in  the  search  for  interesting 
features.  Such  functions  are  not  related  to  the  class 
labels,  and  have  the  potential  of  avoiding  the  curse  of 
dimensionality  (Huber,  1985).  The  method  has  an  un¬ 
derlying  assumption  of  homogeneity  of  the  input  space. 

*This  work  wos  supported  in  port  by  the  National  Science 
Foundation,  the  Office  of  Naval  Research,  and  the  Army  Research 
Office. 


Intuitively  this  means  that  a  useful  feature  can  only  be 
found  based  on  all  of  the  input  patterns.  This  posses  a 
disadvantage  which  is  due  to  the  fact  that  the  labels  are 
not  used  through  the  search  for  good  projections,  and 
therefore,  it  is  possible  to  ignore  features  that  may  only 
be  important  for  classifying  a  small  portion  of  the  input 
data  but  are  less  interesting  when  considering  the  data 
as  a  whole.  This  observation  is  one  of  the  motivations  of 
recursive  partitioning  methods,  including  tree  structured 
algorithms. 

The  proposed  method  is  based  on  the  classification 
and  regression  tree  algorithm  of  CART  (Breiman  et  al., 
1984).  Section  2  discusses  CART  briefly,  and  indicates 
how  the  hybrid  tree  is  constructed.  A  new  splitting  crite¬ 
rion  based  on  a  variation  of  a  back- propagation  network 
is  presented  in  section  3.  Finally  a  short  discussion  con¬ 
taining  the  basic  highlights  of  the  method  is  given. 

2  The  Hybrid  CART 

CART  addresses  high  dimensional  space  problems  by 
partitioning  the  space  and  replacing  complex  classifiers 
(or  regressors)  designed  for  the  whole  input  space,  by  a 
set  of  simpler  modules  working  on  smaller  subregions  of 
the  space.  There  have  been  some  recent  attempts  for 
recursive  partitioning  classification  [see  for  example  (Ja¬ 
cobs  et  al.,  1991;  Sankar  and  Mammone,  1991)]. 

cart's  main  contribution  to  earlier  decision  trees  is 
the  treatment  of  the  additional  bias  introduced  by  the 
over-partitioning  of  the  space.  This  is  done  by  using  a 
splitting  rule  that  does  not  try  to  reduce  missclassifi- 
cation  error  and  by  introducing  a  bottom  up  approach 
to  pruning  the  full  grown  tree  based  on  cross  validatory 
error  estimation.  The  pruning  mechanism  is  a  very  pow¬ 
erful  tool,  and  may  be  useful  in  remote  applications  of 
CART  such  as  image  compression  using  vector  quanti¬ 
zation  (Riskin  et  al.,  1990). 

CART  is  not  directly  applicable  to  classification  prob- 
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lems  in  very  high  dimensional  spaces,  such  as  gray  level 
pixel  images,  since  splitting  based  on  a  single  dimension 
(single  pixel  in  this  case)  is  unlikely  to  increase  the  ho¬ 
mogeneity  of  sub  regions  in  the  space.  In  this  work,  the 
recursive  partitioning  is  based  on  features  extracted  us¬ 
ing  an  EPP  method.  At  each  node  of  the  tree,  additional 
features  are  sought  before  the  split  is  constructed,  using 
only  that  portion  of  the  input  space  that  arrives  to  this 
node,  and  these  new  features  are  added  to  the  features 
extracted  so  far,  to  construct  an  optimal  split  at  that 
node.  This  leads  to  a  combination  of  feature  extraction 
and  recursive  partitioning  that  has  the  potential  to  be 
much  more  powerful  than  each  of  the  methods  by  itself. 
Moreover,  this  method  is  still  consistent  with  the  mono¬ 
tonicity  requirement  of  the  cost  at  each  split  (Breiinan 
et  al.,  1984),  and  therefore  allows  the  use  of  the  powerful 
pruning  mechanism  of  CART. 

The  construction  of  the  hybrid  tree  is  the  same  as  in 
the  CART  method  (Breiman  et  al.,  1984)  with  the  ex¬ 
ception  that  every  node  can  perform  additional  feature 
extraction  based  on  the  high  dimensional  input  patterns 
that  arrive  at  that  node,  and  based  on  the  features  ex¬ 
tracted  so  far.  The  construction  of  a  nested  sequence 
of  trees,  the  pruning  based  on  cost,  complexity  cross- 
validation  and  the  final  tree  selection  can  all  be  done 
exactly  in  the  same  way  as  in  CART. 

The  feature  extraction  part  of  a  node  is  implemented 
by  an  EPP  method  that  seeks  multimodality  in  the  pro¬ 
jected  distributions  (Intrator,  1990).  This  method  is 
based  on  a  biologically  motivated  synaptic  modification 
equations  (Bienenstock  et  al.,  1982),  and  is  computa¬ 
tionally  practical  for  high  dimensional  spaces,  making  it 
suitable  to  be  used  as  the  feature  extractor  in  the  pro¬ 
posed  hybrid  EPP/CART  method. 


3  Pseudo-Supervised  Network 

Although  the  proposed  hybrid  EPP/CART  is  able  to  use 
any  of  the  CART  splitting  rules,  we  would  like  to  con¬ 
sider  a  new  exploratory  splitting  rule  that  allows  linear 
combination  splits.  Linear  combination  splitting  using 
linear  discriminant  functions  was  introduces  in  (Fried¬ 
man,  1977)  and  was  later  replaced  by  the  algorithm  im¬ 
plemented  in  CART.  The  argument  against  linear  com¬ 
bination  splitting  rules  was  that  they  were  found  to  be 
more  biased.  This  bias  comes  from  the  fact  that  the 
split  is  constructed  in  order  to  minimize  some  measure 
of  nonhomogeneity  based  on  the  class  labels,  but  with 
no  concern  to  the  structure  of  the  space  induced  by  the 
input  patterns.  A  simple  example  of  a  possible  bias  is 


shown  in  figure  1.  In  this  figure,  a  two  dimensional  struc¬ 
ture  (possibly  part  of  a  much  higher  dimensional  space) 
can  be  split  in  various  ways  all  of  which  are  similar  in  I 

the  sense  that  thej'  yield  two  pure  subnodes.  However,  if 
the  data  contains  patterns  that  lie  outside  the  two  ovals 
it  is  likely  that  only  split  2  is  optimal.  In  this  section  we 


Figure  1;  Optimal  split  (2)  and  nonoptimal  ones  (1,3).  A 
method  that  tries  to  maximize  homogeneity  based  only 
on  the  class  labels  will  not  distinguish  between  these 
splits,  however,  it  is  likely  that  split  (2)  will  have  bet¬ 
ter  generalization  properties  (will  be  les."  biased  to  the 
training  data). 

present  a  splitting  rule  that  is  based  solely  on  the  input 
patterns.  This  rule  can  be  incorporated  into  the  orig¬ 
inal  CART  method,  and  potentially  to  other  recursive 
partitioning  methods. 

Consider  a  split  that  assigns  the  value  1  to  all  the 
members  of  the  training  set  at  node  t  that  belong  to 
</},  and  the  value  0  to  the  members  of  ti,  so  that  both 
sets  ate  nonempty.  Let  F  =  {fa}  be  a  set  of  continuous 
functions  that  depend  on  a  parameter  q,  fa  maps  the 
input  space  to  tO.  Ij.  Let  \(  be  a  characteristic  function 
assigning  the  value  1  to  j  £  t,  and  0  else.  For  a  given 
split  s.  assume  that  fa,  is  the  best  approximator  (not 
necessarily  unique)  to  the  characteristic  function  \,„  in 
the  MSE  sense.  Now  seek  the  optimal  split  s'  so  that 
E[{fa,.  -  \r^)'.  is  minimized. 

Finding  an  optimal  split  in  this  way  ensures  that 
within  a  given  set  of  continuous  functions,  this  split  re¬ 
sults  in  a  function  which  is  able  to  assign  the  data  in 
a  value  closest  to  zero  (in  the  MSE  sense),  and  the  data 
in  Iff  values  that  are  closest  to  one.  Thus  ensuring  that 
the  patterns  that  belong  to  //?  are  in  some  measure  close 
to  each  other,  and  far  apart  from  the  patterns  in  i.e., 
increased  homogeneity  of  the  input  space. 

An  example  where  this  splitting  rule  along  with  fea¬ 
ture  extraction  may  be  useful  is  given  in  figure  2.  It 
shows  a  subregion  in  space  in  which  two  classes  are 
strongly  mixed.  A  supervised  splitting  algorithm  will 
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split  according  to  hyperplane  1  whereas  the  above  un¬ 
supervised  splitting  rule  will  prefer  to  split  according  to 
hyperplane  2.  This  is  because  split  1  increases  the  purity 
of  each  node  more  than  split  2  although  split  1  does  not 
focus  on  the  confusion  region  between  class  A  and  B.  It 
is  conceivable  that  if  the  confused  region  is  transferred 
in  full  to  a  node,  and  then  an  attempt  to  extract  more 
informative  features  only  from  this  region  is  made,  the 
new  representation  will  have  a  better  chance  to  reduce 
the  confusion  between  the  classes  in  this  subregion. 


Figure  2:  The  ability  of  an  unsupervised  splitting  rule 
to  reduce  confusion. 


3.1  Splitting  Rule  Implementation 

In  order  to  use  a  gradient  descent  method  for  finding 
the  optimal  split,  we  need  to  overcome  the  discontinuity 
introduced  by  the  function  \t„.  Therefore,  a  continuous 
approximation  to  is  used.  We  shall  follow  the  nota¬ 
tions  presented  in  (Rumelhart  et  al.,  1986),  and  present 
a  splitting  rule  that  is  based  on  a  variation  of  error  back- 
propagation  network. 

Let  Opj  be  the  output  of  the  j'th  splitting  rule  func¬ 
tion  for  input  pattern  p.  fj  is  a  sigmoidal  activation 
function  defined  by  /,(<)  =  [1  -I-  exp{-t)]"',  so  that 
Opj  =  fj(netpj),  where  netpj  =  u’j,Op. .  Let  the  tar¬ 
get  for  output  j  be  also  defined  in  terms  of  the  network 
activity,  tpj  =  /j (netpj),  where  fj  is  a  sigmoidal  function 
with  a  gain  constant  A  >  1,  fj(t)  =  [1  -I-  exp(-At)j“'. 
The  network  is  trained  to  minimize  the  empirical  MSE 
Sp(^P  —  Op)^.  In  order  to  avoid  trivial  splits  it  is  possible 
to  add  penalty  of  the  form 

p  p 

for  some  small  constant  «,  however,  simulations  show 
that  the  trivial  split  does  not  usually  happen  especially 


when  there  are  several  neurons  in  the  hidden  and  output 
layer. 

The  difference  between  tpj  and  Opj  is  shown  in  figure  3. 
This  target  function  approximates  a  characteristic  func- 


Figure  3:  The  minimization  of  the  pseudo-supervised 
MSE,  is  equivalent  to  minimizing  the  shaded  area  in  the 
picture. 

tion,  an  approximation  which  will  improve  when  A  — •  oc. 
In  practice,  there  is  no  need  to  have  A  be  greater  than 
5.  The  calculation  of  the  gradient  with  respect  to  the 
weight  u’,j  follows  in  the  same  way  as  in  (Rumelhart 
et  al..  1986),  when  taking  into  account  the  fact  that  the 
target  depends  on  the  network  output  as  well.  For  an 
output  layer  unit  j  we  have 

dEp  _  ,  dEp  dopj  ^  dEp  dtpj  ,  dnetpj 

du'j,  'dopjdncipj  dtpjdne1pj‘  dwj, 

and  it  follows  that  for 

^pi  ~  PI  ~  ^pj )  i^pj  ( ^  “  '^pj )  ”  ^^pj  ( ^  ^  )i ' 

we  gel 

dEp 

dup,  ^ 

The  calculation  of  the  gradient  with  respect  to  a  hidden 
unit  weight  is  exactly  as  in  (Rumelhart  et  al.,  1986),  and 
will  not  be  repeated  here. 

An  intuitive  explanation  to  this  target  definition  is 
similar  to  the  reasoning  behind  hard  and  soft  competi¬ 
tion  approaches  (Hinton  and  Nowlan,  1990).  If  a  hard 
target  (0  or  1)  is  imposed,  then  whenever  the  output  is 
close  to  .5  which  means  that  the  input  is  close  to  the 
boundary,  the  error  signal  would  be  large.  However  if 
the  input  is  close  to  the  boundary,  it  is  likely  to  be  on 
the  wrong  side  of  the  boundary,  which  will  then  lead  to 
a  large  wrong  correction  signal.  I  sing  the  soft  target 
which  takes  into  account  the  confidence  in  the  output 
solves  this  problem,  since  the  target  is  also  close  to  0.5. 
Another  explanation  is  obtained  by  observing  that  the 
target  is  also  dependent  on  the  synaptic  weights,  and 
therefore  the  gradient  of  the  synaptic  weights  with  re¬ 
spect  to  the  output  should  be  taken  into  account  as  well. 
This  requires  the  use  of  a  soft  target. 
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The  construction  of  a  binary  splitting  rule  based  on 
the  above  criterion  is  done  by  letting  the  PS  network 
converge  (or  stop  training  based  on  another  criterion) 
and  then  assign  the  patterns  for  which  the  output  of  the 
network  is  greater  than  .5  to  <r.  In  the  case  of  a  multi- 
split,  assign  to  set  j  the  patterns  for  which  the  output 
of  unit  j  in  the  network  is  greater  than  .5. 

4  Discussion 

A  method  of  recursive  partitioning  for  high  dimensional 
input  spaces  was  introduced.  This  was  done  by  combin¬ 
ing  the  benefits  from  exploratory  projection  pursuit  with 
those  from  the  CART  method.  A  new  exploratory  split¬ 
ting  rule  was  presented,  and  argued  to  have  the  potential 
to  be  less  biased  to  the  training  data.  This  splitting  rule, 
can  have  a  boundary  that  contains  an  arbitrary  predefind 
number  of  hyperplanes  by  defining  the  number  of  hidden 
units  in  the  feedforward  network,  and  is  easily  extended 
into  multiple  splits.  The  implementation  of  the  split¬ 
ting  rule  using  a  new  unsupervised  training  algorithm  to 
back-propagation  is  potentially  useful  to  other  purposes 
where  a  soft  competition  rule  is  better  than  a  hard  one, 
e.g.  in  adaptive  equalization  (Lucky,  1966;  Hinton  and 
Nowlan,  1990). 

Combining  all  the  above  ingredients  together,  re¬ 
sults  in  a  computationally  practical  method  for  non- 
parametric  classification  in  very  high  dimensional  spaces, 
that  is  less  sensitive  to  the  curse  of  dimensionality  due  to 
the  feature  extraction,  and  is  less  biased  to  the  training 
data,  due  to  the  sophisticated  tree  construction  of  the 
CART  method. 
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Abstract 

This  paper  presents  a  class  of  non-parametric  density 
estimators  on  a  low  dimensional  space.  The  support 
of  these  estimators  is  defined  by  the  convex  hull  of 
the  set  of  observations.  A  random  sample  from  the 
set  of  observations  is  used  to  tessellate  the  interior  of 
the  convex  hull.  The  attribution  of  empirical 
probability  mass  to  the  tiles  resulting  from  the 
tessellation  produces  a  density  estimate.  With  a  set 
of  appropriate  linear  constraints  on  the  attribution 
of  mass,  the  estimator  is  shown  to  be  a  conditional 
maximum  likelihood  estimator.  Repeating  this 
procedure,  and  averaging  these  density  estimates 
within  tiles,  produces  a  bootstrap  estimate  of  the 
density  function.  The  results  of  this  resampling  and 
density  estimation  process  are  presented  in  graphic 
form. 

1.  Introduction 

The  objective  of  this  paper  is  to  construct  a 
class  of  non-parametric  probability  density 
estimators,  f(X)  of  /(X),  that  make  few,  and 
comparatively  weak  assumptions  about  the  support 
and  characteristics  of  f(x)  beyond  that  provided  by 
a  set  of  ob-scrvations,  Y.  Let  Y  =  be  a 

set  of  observations,  with  Y^&S^,  t=l,"-,n,  and 
a  t/-dimensional  real  product  space, 
with  p.  the  usual  tf-dimensional 
Lel)esgue  measure.  A  non-empty  class  of  estimators. 


/(X),  exists  that  has  the  maximum  likelihood 
projaerty,  and  is  strongly  consistent,  given  a  set  of 
observations  Y  C  5’“^,  1  <  <i  <  00,  and  a  minimum 
number.  A,  of  observations  per  tessellating  tile. 

2.  Support 

The  support  for  /(X)  is  defined 
i4  =  |xG5‘^:0<  /(X)<  oo|.  We  will  define  the 
support  A”  for  /(X),  given  a  set  of  observations  Y 
of  size  n,  the  smallest  closed,  convex  region  in  S'^ 
that  contains  Y.  Note  that  if  another  observation  is 
added  to  Y  then  Another  way  to 

describe  A"  is  to  say  that  A"  is  the  set  defined  by 
the  convex  combination  of  the  elements  of  Y.  Let 
H  be  the  set  of  V,  6  Y  that  are  on  the  convex  hull 
of  A".  Then  the  definition  of  A”  can  be  formulated 
as  A"  =  {  X  G  S‘^:  X  =  aH  -1-  ( 1  —  q)H },  for  all  o. 
0<o<  1  . 

A"  G  can  be  seen  in  Figure  1  as  the  region 
defined  by  line  segments  connecting  points  in  the 
point  cloud  of  observations  such  that  A"  is  convex, 
and  Y  is  contained  in  A".  Also,  H  is  the  set  of 
Y^^Y  that  are  vertices  for  the  line  segments  that 
define  9A",  the  convex  hull  of  A". 
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Figure  1 


To  derive  an  estimate  of  the  density  of 
observations  on  A"  it  is  necessary  to  examine  sub- 
regions  of  A"  and  compare  the  weight  of 
observations  on  that  region  to  the  total  weight  of 
observations.  Two  general  methods  are  currently 
well  known,  the  kernel  density  estimation  method 
[Rosenblatt,  1956],  [Parzen,  1962]  and  the  binning 
method  [Scott,  1985],  [Carr,  1987]. 

3.  Density  Estimation  Methods 

With  the  kernel  method,  a  smoother  with  finite 
support,  such  as  an  Epanechnikov  kernel,  or 
smoother  with  infinite  support,  such  as  a  Gaussian 
kernel,  is  convolved  with  the  empirical  distribution 
and  a  weighted  sum  of  the  contributions  to  the 
density  at  the  center  of  the  kernel  is  computed  from 
the  observations.  This  approach  assumes  support  is 
continuous  beyond  the  region  defined  by  the  set  of 
observations.  But  more  important,  the  theoretical 
computational  complexity  increases  with  the 
dimension. 

The  binning  method  tessellates  the  support 
into  fixed  size  tiles  and  computes  the  density  on  a 


tile  as  the  ratio  of  the  weight  of  observations  on  each 
tile  to  the  total  weight  of  observations  times  the  area 
of  the  tile.  This  method  is  computationally  quite 
tractable.  The  problem  with  this  method  is  that  to 
get  reasonably  smooth,  non-trivial,  estimates  where 
the  data  are  sparse,  the  tiles  must  be  relatively  large. 
But,  by  making  the  tiles  large,  the  fine  structured 
features  of  the  density  are  obscured  where  the  data 
are  closely  packed. 

A  third  method  is  proposed.  This  method  is  to 
tessellate  the  support  A"  into  tiles  of  varying  sizes, 
based  on  the  location  of  the  observations.  This 
might  be  called  data  directed  tessellation.  In  this 
way  the  tiles  will  be  large  where  data  are  sparse,  and 
small  where  the  data  are  closely  packed. 
Furthermore,  no  assumptions  are  being  made  about 
support  beyond  the  convex  hull  of  the  point  cloud 
defined  by  the  observations. 

4.  Tessellation  of  Support 

Of  the  many  possible  ways  that  the  data  might 

direct  the  tessellation  of  A”  6  there  is  one  that 

is  unique  for  any  inter-point  distance  measure  /p, 

1  <  p  <  oo,  up  to  pathological  cases  [Preparata, 

1988].  This  is  the  high  dimensional  analog  of  the 

Delaunay  tessellation,  where  d  +  I  points  define  a 

tesscllating  polytope,  and  any  point  in  the  interior  of 

a  defined  polytope  is  closer  to  these  d  -I-  1  points 

than  to  any  other  d+\  points  in  the  tesscllating 

point  set.  The  Delaunay  tessellation  of  the  support 

A"  yields  a  set  of  convex  polytopes  {A"}  of 

cardinality  m,  where  0  =  Aj  O  A:  and 

•  /  J  , 

A"  =  U  A"  have  measure  p(A")  >  0,  1  <  i.j  <  m, 
i 

and  have  a  geometric  nearest  neighbor  property  for 
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points  in  the  interior  of  each  polytop>e.  It  is  the 
geometric  properties  of  the  Delaunay  tessellation 
that  make  it  a  more  computationally  tractable 
procedure  for  tessellating  a  (/-dimensional  product 
space. 

Figure  2  is  the  Delaunay  tessellation  of  25 
observations  in  S^.  Note  that  the  d  +1  points  that 
define  each  triangle,  or  tessellating  polytope,  also 
define  a  circumscribed  circle,  and  this  circle  contains 
no  other  points  in  the  tessellating  set  of 
observations. 


5.  Probability  Mass 

The  empirical  probability  mass  on  elements  of 
the  tessellating  set  {i4”}  needs  to  be  examined.  Let 
dA"  be  the  convex  hull  of  j4".  Then 
P[x  £  i4"]  =  0  for  a  random  X  £  S^.  But  since  A” 
is  defined  by  elements  of  Y,  there  are  d+  \  elements 
of  Y  in  dA^.  Those  observations  that  are  in  the 
interior  of  >1"  attribute  all  of  their  weight  to  ^4", 
0  <  t  <  m.  A  question  arises  when  the  attribution  of 
weight  for  points  in  5i4"  is  considered.  Let  be 
the  weight  attributable  to  A"  from  observations  in 
the  interior  of  >4",  and  let  w*  be  the  weight 
attributable  from  observations  in  5i4".  Then  the 


total  weight  on  a  tile  >4"  is  W^(w4")=  + 

The  assignment  of  weight  to  10*  by  the  d 
observations  in  5A”  can  be  computed  by  solving  a 
linear  system  of  equations  to  maximize  the 
likelihood  product. 


6.  Density  Estimator 

A  class  of  density  estimators  can  be  defined  on 

WiA’^) 


>1”  by  /(X)  =  : 


for  X  £  i4". 


where  A")  is  the  total  weight  of  observations  on 
the  support  A",  and  /i(A”)  is  the  integral  measure 
of  A”.  This  class  of  estimators  can  be  shown  to 
have  the  maximum  likelihood  property  [Robertson, 
1967].  If  the  additional  constraint  is  added  to  the 
tessellation  procedure  that  at  least  A  observations 
are  contained  in  each  polytope,  where  A  is  a  function 
of  the  sample  size,  then  this  class  of  estimators  can 
be  shown  to  be  strongly  consistent,  given  a  set  of 
observations  Y  [Wegman,  1975]. 


7.  Examples 

The  first  two  examples.  Figure  3  and  Figure  4, 
show  the  density  estimate  for  a  data  set  with  201 
observations,  and  A  =  6. 
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Figure  3  is  the  results  of  the  first  iteration. 
Each  tile  contains  at  least  6  observations.  The 
estimate  is  rough  but  it  shows  that  the  data  is  at 
least  bi  modal  on  S^. 

Figure  4  is  the  average  density  on  all  subtiles 
of  the  same  data  set  after  15  iterations  of  the 
resampling  procedure.  The  solid  bars  on  the  ends 
are  the  result  of  always  having  to  have  at  least  6 
observations  per  tile.  The  plot  is  still  rough,  but  it 
accurately  reflects  the  density  of  the  observations, 
without  making  continuity  assumptions. 


Again  the  density  plot  looks  ragged,  particularly 
over  relatively  small  regions  at  both  ends,  and  with 
a  relatively  smooth  region  between.  This  is  caused 
by  relatively  sparse  data  between  regions  of 
relatively  dense  data.  Also  the  density  of  the  data 
would  appear  to  be  bimodal. 

Figure  6  is  the  distribution  computed  by 
integrating  the  density  from  Figure  5  over  the 
support.  By  applying  integration  as  a  natural 
smoother,  it  is  reasonably  clear  that  the  data  is  a 
random  sample  from  a  step  wise  continuous  uniform 


Figure  5 


Figure  7 
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Figure  8  is  the  result  of  resampling  for  these 
same  25  observations,  with  A  =  1.  After  only  three 
iterations  the  support  has  been  partitioned  into 
small  subtiles  where  the  data  are  dense,  and 
relatively  large  subtiles  where  the  data  are  sparse. 
The  resulting  average  density  estimate  on  subtiles  is 
thus  more  refined  where  the  data  are  dense,  and  less 
refined  where  the  data  are  sparse.  Note  also,  that 
the  marginal  densities  will  be  piecewise  continuous 
on  the  support  i4". 


8.  Conclusions 

The  adaptive  density  esi f  ation  procedure  has 
much  in  common  with  the  bootstrap.  As  such,  it 
inherits  many  of  the  theoretical  statistical  propierties 
of  the  bootstrap.  It  has  additional  properties  derived 
from  geometry  and  the  tessellation  procedure 
employed.  In  combination,  these  properties  offer  a 
rich  area  for  theoretical  statistical  study,  for 
exploratory  data  analysis,  and  for  extending  our 
understanding  of  computational  geometry.  From  a 
computing  perspective,  this  procedure  is  somewhere 
between  binning  methods  and  kernel  density 
estimation  methods  for  both  compute  time  used  and 


storage.  The  computational  advantages  of  adaptive 
density  estimation  methods  over  kernel  density 
estimation  methods  becomes  quite  dramatic  as  the 
dimension  of  the  support  increases. 
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Abstract 

Procedures  for  estimating  the  probability  density 
curve  of  the  distribution  from  which  a  single  sample 
of  size  n  has  been  taken  will  often  produce  curves 
which  are  quite  erratic  and  require  much  smoothing. 
We  consider  a  simple  method  of  density  estimation 
which  will  produce  smooth  curve  estimates  without 
applying  any  smoother.  We  apply  the  method  to  both 
symmetric  as  well  as  skewed  distributions. 

1.  Introduction 

In  this  paper  we  consider  the  estimation  of  the  pro¬ 
bability  density  curve,  f(x),  of  a  continuous  distribution 
from  which  a  single  sample  of  size  n  has  been  taken. 
There  are  numerous  parametric  procedures  of  density 
estimation  in  which  the  sample  is  assumed  to  arise 
from  some  family  of  distributions  from  which  it  is  then 
necessary  to  select  the  member  of  this  family  which 
best  describes  the  given  data  set.  For  example,  the 
method  of  maximum  likelihood  selects  that  member  of 
the  distributional  family  for  which  the  probability  of 
having  obtained  the  given  sample  is  maximized.  In 
Bayes  estimation  a  prior  distributional  model  is  combi¬ 
ned  with  the  information  provided  by  the  sample  to 
produce  a  posterior  model. 

Non-parametric  methods  of  density  estimation  typi¬ 
cally  make  use  of  the  spacing  or  clustering  of  the 
points  in  the  data  set.  For  large  data  s.ts,  the  fre¬ 
quency  histogram  and  its  corresponding  frequency  curve 
provide  a  rough  estimate  of  the  shape  of  the  distri¬ 
bution.  However  these  methods  tend  to  be  somewhat 
arbitrary  as  to  the  choice  of  class  interval  and  method 
of  smoothing.  They  also  are  of  little  use  for  small 
samples.  Chambers,  Cleveland,  Kleiner  &  Tukey  (1983) 
suggest  a  gmeralization  in  which  the  class  interval 
(window)  is  allowed  to  move  along  the  entire  range  of 
the  data.  The  fraction  of  the  entire  data  set  in  the 
window  is  a  measure  of  the  density  at  the  center  of  the 
window.  They  call  this  the  density  trace.  Since  this 
method  can  be  quite  erratic  as  points  enter  or  leave  the 
window  as  the  window  moves  along  the  real  line,  a 
snKxrther  result  is  obtained  by  averaging  the  number  of 
data  points  using  a  weight  function  (the  kernel)  which 
is  a  maximum  near  the  center  of  the  window  and 


decreases  to  zero  at  the  edges  of  the  window.  This 
leads  to  a  method  now  called  kernel  density  estimation. 
Kernel  estimation  does  not  seem  to  depend  too  much 
on  the  type  of  kernel  used,  but  does  vary  greatly  with 
the  length  of  the  window,  or  “band-width”.  The 
method  also  cannot  properly  estimate  the  dmsity  curve 
beyond  the  data  set  as  the  density  drops  suddenly  to 
zero.  See  also  Tapia  &  Thonq>son  (1980)  for  more 
details  on  these  different  methods. 

In  our  approach  we  will  begin  with  a  suitably 
smooth  density  curve  which  is  then  stretched  or  com¬ 
pressed  to  fit  the  spacing  of  the  the  data  in  the  sample. 

2.  Basic  method 

Consider  the  ordered  sample  Xo=-oo<X]<X2< 
•••<x„<x„  +  i=oo,  taken  from  a  distribution  F(x) 
with  density  hmction  f(x),  which  is  what  we  wish  to 
estimate.  Let  us  define  the  centering  points, 

=  -  -f-  K  i  =  l2,3 . n-1.  (1) 

These  f,-  partition  9?  into  n  sub-intervals,  each  contain¬ 
ing  exactly  one  observation.  Suppose  we  take  a  conti¬ 
nuous  distribution,  G(x),  which  we  shall  call  the  trial 
distribution.  For  now,  we  will  assume  that  G  is 
location-scale  invariant.  If  we  use  G  to  divide  into 
n  intervals  of  equal  probability 


Then,  let  us  estimate  f(x),  by  requiring  that  the  cumu¬ 
lative  distribution  satisfy  (2)  and  that  the  curve  between 
these  points  be  continuous  and  smooth.  We  can  thai 
estimate  f(x),  up  to  a  multiplicative  constant,  by  stret¬ 
ching  or  compressing  g(x),  the  density  curve  corre¬ 
sponding  to  G{x), 

=  +  (3) 

where  pj  and  a,-  are  chosen  to  satisfy  (2), 
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which  gives 


M.= 


\  +  lU  Vi  +  1 
\  +  1  ~ 


(5) 


The  constant  c  is  required  since  we  are  changing 
the  scale  as  we  stretch  or  conqrress  the  distribution  to 
fit  the  quantiles,  and  hence  we  must  determine  c  by 
(numerically)  integrating  the  resulting  density  curve 
given  by  (3)  and  (S).  In  order  to  apply  this  procedure 
we  will  need  to  conqrute  the  quantiles  of  the  trial  distri¬ 
bution,  ,  , 

\  =  G"’U),  /  =  l,2,3,...,n-l.  (6) 

Actually,  since  the  t^'s  are  taken  halfway  between 
the  sample  values,  it  would  be  better  to  adjust  the 
quantiles  for  this  shift  and  use: 

Xi  =  tr'(jrTr)  /  =  i.2,3,...,«-i.  (?) 

Indeed  when  we  used  a  sample  of  24 
“observations"  consisting  of  the  quantiles  of  a  standard 
normal  and  applied  the  procedure  using  (3),  (S)  and  (6) 
and  a  normal  trial  distribution,  we  obtained  an  density 
estimate  which  was  very  “normal”  in  shape,  but  with 
mean  near  zero  and  standard  deviation  of  only  0.8S 
(both  quantities  numerically  integrated  from  the  density 
curve  estimate^  On  the  other  hand,  when  we  use  (7) 
in  place  of  (6),  we  obtained  the  estimate  shown  in 
figure  1,  which  has  a  computed  mean  of  almost  exactly 


Figure  1:  Density  estimate  of  normal  quantiles  “data” 


0  and  standard  deviation  1.01.  [Using  only  a  sample  of 
4  quantiles  produced  even  worse  results  using  (6),  but 
again  produced  the  standard  normal  when  using  (7).] 


When  we  used  a  logistic  trial  distribution  with  the 
sample  of  24  standard  normal  quantiles,  and  equations 
(3),  (5)  and  (7)  ,  we  again  obtained  a  very  normal¬ 
shaped  curve  but  with  a  slightly  smaller  standard  devia¬ 
tion  of  only  0.95.  This  is  not  surprising  as  the 
logistic  distribution  tends  to  have  a  similar  shape  to  the 
normal  but  is  slightly  tuurower  and  has  longer  tails. 

For  an  open-ended  distribution  with  positive  proba¬ 
bility  over  the  entire  real  line,  Xo"-oo  and  X„koo, 
and  so  there  is  no  solution  possible  for  the  first  or  last 
interval.  The  simple  solution  is  to  take  m  and  a,-  for 
the  neighboring  intervals: 

1*0  =  ‘to  =  ffi;  ;i„  =  /4„  +  1,  ‘tjfi  =  <r„  + 1.  (8) 

A  sample  of  five  observations  is  generated  from  a 
normal  population  and  the  density  estimate  obtained 
using  a  normal  trial  distribution  (solid  line)  and  also 
using  a  logistic  trial  distribution  is  shown  in  figure  2. 
Both  plots  are  very  similar  and  both  plots  depend  very 
greatly  on  the  trial  distribution  since  the  sample  is  so 
small.  For  a  sanople  n  =  2  or  n  =  3,  the  estimated 
density  would  be  identically  the  trial  distribution  with 
the  location  and  scale  determined  from  the  sample. 


2.  Smooth  density  estimates 

We  can  apply  the  procedure  to  a  sample  showing 
definite  skewness,  as  shown  in  figure  3.  Here  we  get 
an  estimate  which  has  some  skewness,  but  not  to  the 
extent  demonstrated  by  the  sample.  The  problem  is 
that  the  resulting  estimate  is  not  very  smooth  (at  the 
boundaries  of  the  intervals).  Also,  our  evStinute  will 
always  be  strictly  decreasing  and  hence  carmot 
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adequately  model  multi-modal  situations. 


Figure  3:  Density  estimate  for  skewed  sample 
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Data  values 


The  data  used  in  figure  3  is  the  measurements  of 
the  ozone  level  at  Stamford,  Connecticut  for  136  days 
which  is  used  by  Cleveland,  Chambers,  Kleiner  & 
Tukey  (1983)  in  their  example  of  the  density  trace. 

The  density  estimates  introduced  in  the  last  two 
sections  are  only  piece-wise  smooth,  as  we  are  using 
pieces  of  different  members  of  the  family  of  trial  distri¬ 
butions  for  each  of  the  intervals.  To  obtain  a  smooth 
estimate,  let  us  extend  the  estimate  for  each  interval  to 
the  entire  real  line  and  average  the  estimates  of  the 
R  -  2  (in  this  case)  intervals  together: 

(9) 

where  Hj  and  Uj  are  given  by  (5)  as  before.  As  long 
as  g(-)  is  a  proper  density  function,  we  will  not  need 
to  normalize  (9)  as  it  will  properly  integrate  to  one. 

To  further  illustrate  this  procedure,  let  us  consider 
using  the  three-parameter  Weibull  distribution  as  the 
trial 

distribution, 

1  -  x>v,a>Q,p>  0,(10) 

where  G(z)  =  1  —  e~*.  The  density  is  then  estimated  by 


In  this  case  the  value  of  v  must  be  determined  from  the 
data  or  fixed  arbitrarily.  The  remaining  two 
parameters,  p  and  a,  can  then  be  determined  for  each 


interval  by 


which  gives 

log  X,  +  i-tog  X, 

+!-«')-  iog(ti  -  vy 

(13) 

fog  +x-y)-log  \+  ^logjti  -  >>)\ 

log(tt+i-v)-logiti-v)  / 

The  normal  quantile  data  with  Weibull  trial  distri¬ 
bution  taking  y=  -3  together  with  the  plot  for  normal 
and  logistic  trial  distribution  are  shown  in  figure  4. 
All  three  curves  have  a  very  similar  “normal”  sh^ 
differing  only  slightly  in  the  amount  of  peakedness. 


Figure  4:  Smooth  estimates  for  normal  quantiles 
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The  value  of  v,  which  must  be  less  than  the 
smallest  observation,  can  be  arbitrarily  assigned  based 
on  the  physical  situation  {e.g.,  y  =  0  when  data  must 
be  non-negative).  This  gives  a  graph  which  behaves 
somewhat  strangely  near  zero,  with  the  density  estimate 
suddenly  shooting  up  to  a  large  value.  This  might  be 
explained  in  that,  although  we  are  assuming  a  strictly 
continuous  and  positive  distribution  for  the  ozone  level, 
the  actual  distribution  may  have  a  positive  probability 
of  a  zero  level.  If  we  instead  assume  the  density 
begins  at  some  negative  value,  then  we  can  estimate  the 
probability  at  zero  to  be  the  area  under  the  curve  to 
the  left  of  x  =  0.  (The  same  argument  can  be  applied 
to  the  normal  and  logistic  estimates.) 
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Thus,  let  us  estimate  the  value  of  y  from  the  data 
by  selecting  the  value  of  y  which  will  result  in  the 
maximum  value  for  the  estimated  likelihood  function. 
By  inspection  we  find  that  this  “maximum  likelihood 
estimate”  is  obtained  when  y  =  —143,  which  gives  the 
curve  shown  in  figure  S.  Figure  5  also  shows  the 


Figure  5:  Smooth  density  estimates  for  ozone  data 


density  estimates  for  normal  and  logistic  trial  distri¬ 
butions.  Note  again  that  all  three  estimates  are  very 
similar,  the  major  differences  being  the  peakedness  of 
the  three  local  maxima. 


Figure  6:  Smooth  density  estimates  for  Darwin  data 


We  now  consider  two  more  “real  data”  examples. 
The  Darwin  data  of  the  heights  of  IS  plants,  as  given 
in  Box  &  Tiao  (1973),  which  is  often  modeled  as  a 


Cauchy  density: 

-67,  -48, 6,  8, 14, 16, 23, 24, 28, 39, 41, 49, 56, 60, 75. 
Again,  using  normal,  logistic  and  maximum  likelihood 
Weibull  trial  distributions  we  get  dmsity  estimates  as 
shown  in  figure  6.  Again,  all  three  estimates  are  very 
normal-shaped,  exc^t  with  very  noticeable  long  tails  as 
is  characteristic  of  the  Cauchy  distribution.  (Note 
however,  this  data  does  not  necessarily  arise  from  a 
Cauchy;  it  is  only  often  modeled  by  a  Cauchy  due  to 
the  ^parent  “outliers”  in  the  tails.) 

Another  data  set,  due  to  Davis  (1952),  consists  of 
reliability  measurements  of  an  electronic  component: 

758,  855, 905, 918, 919, 920, 929, 936, 948, 950, 972, 

1035, 1045, 1067, 1092, 1126, 1156, 1162, 1170, 1196. 

Again  using  normal,  logistic  and  maximum 
likelihood  Weibull  trial  distributions  we  obtain  very 
similar  estimates,  as  shown  in  figure  7,  which  again 
differ  mainly  in  the  amount  of  peakedness  near  each  of 
the  local  maxima. 


Figure  7:  Smooth  density  estimates  for  Davis  data 


Data  values 
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A  COMPARISON  OF  TWO  LARGE  SAMPLE  CONFIDENCE  INTERVALS 
FOR  A  PROPORTION:  A  MONTE  CARLO  SIMULATION 


Ken  Hung 

College  of  Business  and  Economics 
Western  Washington  University 
Bellingham,  WA  98225 


Abstract 


Two  pairs  of  confidence  intervals  for  a  proportion  as 
in  page  394  of  Larson’s  (1982)  are  compared.  It  can  be 
shown  through  computer  simulation  experiments  that, 
for  certain  values  of  p,  the  confidence  interval 
obtained  by  the  approximation  is  superior. 

1.  Introduction 


-7==^=  =  ^  ~  P  (2  1) 

>lnp(l-p)  Jp(l-p)/n 

is  distributed  asymptotically  as  a  standard  normal 
random  variable.  Thus,  we  have 


Computers  are  to  the  study  of  statistics  much  as  test 
tubes  to  the  study  of  chemistry.  Many  theoretical 
derivations  in  statistics  can  be  investigated  and 
confirmed  by  brute-force  computing  experiments.  The 
purpose  of  this  paper  is  to  study  empirically  two  large 
sample  confidence  intervals  for  a  proportion  in 
Larson’s  (1982)  and  the  issue  raised  in  Alt  and 
Walker  (1981).  Alt  and  Walker  (1981)  derived 
analytically  that,  for  certain  ranges  of  p,  the 
approximated  (l-o)100%  confidence  interval  for  a 
proportion  is  shorter  than  the  unapproximated  one. 
The  basis  for  comparison  in  their  paper  is  the 
e.vpected  value  of  squared  confidence  interval  length, 
while  in  this  paper  the  expected  value  of  the 
confidence  interval  length.  This  should  be  more  direct 
to  the  truth  of  the  nature. 

Section  2  discusses  the  theoretical  and  analytical 
aspects  of  the  comparison  of  two  confidence  intervals. 
The  approach  and  method  used  in  this  study  is 
presented  in  Section  3.  The  results  are  reported  in 
Section  4.  Section  5  concludes  the  paper. 


2.  Theory 

Let  Xj,  Xj,  .  be  a  sequence  of  n  independent 

Bernoulli  random  variables  with  parameter  p  as  the 
probability  of  success  on  each  trial.  Then,  X  =  = 

»  r,  for  i  =  1,  . .  n,  is  a  Binomial  random  variable 

where  r  =  A/n  =  p  .  Given  E{X)  =  np  and  Var{X) 
=  np(l-p),  it  follows  from  the  Central  Limit  Theorem 
that 


which  is  equivalent  to 


p(l-p)/n 


<  Z)  =  1  -  a 


where  ^  ~^q/2  .  The  equation 

(  X  -  p  )^  <  Z^p(l-p)/n  (2.4) 

as  a  quadratic  inequality  in  p  has  two  real  and 
unequal  roots.  These  two  roots  are  the  desired 
confidence  limits  for  p.  Let  Qj  and  Q2  denote  the 
lower  and  upper  confidence  limits  respectively.  Use  of 
the  quadratic  formula  yields 

^  x+Z^n  -  (Z/^f7i)  Jx(l  -  x)+Z^/n^ 

*= - 


^  X  +  ZVn  -I-  (Z/^^7i)  Jx  (1  -  x)  Z*/n^ 

Q2 - YTFTn 


When  n  is  large  and  for  reasonable  (1  -  a), 
Z^/n  should  approach  zero.  Therefore,  the 
approximated  large  sample  confidence  limits  are 

Ij  =  X  -  (Zf-fn)  Jx  (1  -  x)  (2.6) 

^2  =  X  -t-  (Z/Vn)  J X  (1  -  x) 

The  above  approximated  confidence  limits  can  also  be 
derived  from  the  asymptotical  standard  normal 
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random  variable 


X  -  P  _  P  -  P 

jiTTijTn  Jp(l-p)/n 


■since  E( 


x(l-r) 


K  P(l-P)  ; 


IS 


(2.7) 


asymptotically  unbiased.  See  Alt  and  Walker  (1981). 
Kor  the  confidence  interval  in  (2.5),  we  can  have 


{n  +  Z- 


•2  ^2 


2  n2 


(n+  ) 


(2.8) 


For  the  confidence  interval  in  (2.6),  we  can  have 

iU-  L,)-  =  x  (l-x)ifi  (2.9) 


Lot  Q^EiQn  -  Qi)^  and  L=E{L2  -  Ly)^.  Solving  for 
the  inequality  Q  >  L  gives  us  the  ranges  of  p  values 
that  satissfy  the  inequality.  Since  E(  x  )  =  p  and 
E  (  7-  )  =  Var{  i  )  +  (£’(  x  )]^  =  p(l-p)/Ti  +  p^,  we 
can  easily  show  that 


(2.10) 


Fin  n  'i2_p(^‘P)(”'0  AnZ^  I  t''  n  1 

E(0r«.)  - - n - (2-ni 


and 


Simplify 


(2.12) 


p(l-p)(n-l)  4nZ-  P(l-p)(»-n4Z-  (o 

n  („+z2)2+(n+z2)2^  "  n 


to 


4(l-l/n)(2n+Z' 


>  P(l-P)- 


(2.14) 


Wlicn  n  is  large  and  Z^  ignored,  the  left  hand  term  of 
(2.14)  can  be  approximated  by  g  .  Thus,  a  quadratic 
inequality  in  p 

p"  -  P  +  I  >  0  (2.15) 

can  be  solved  with  two  roots  p  =  .146  and  p  =  .854  . 
Ihis  means  that  for  p  <  .146  and  p  >  .854,  the 
expected  squared  length  of  (2.5)  is  greater  than  that 
of  (2.6). 


3.  Method 

A  Fortran  program  is  written  so  as  to  call  l.MSL 
subroutine  GGBN  for  generating  binomial  random 
numbers  and  to  compute  the  average  lengths  for  the 
two  pairs  of  confidence  interval  in  (2.5)  and  (2.6).  The 
Z  value  is  set  to  be  1.96  so  as  to  give  the  same  95% 
confidence  intervals  for  both  pairs.  The  number  of 
trials  n  in  each  experiment  is  increased  from  50  to  250 
by  50  at  different  proportion  values  in  the  inclusive 
(0,1)  range.  One  hundred  experiments  are  performed 
generating  100  binomial  random  numbers  (  .V  for 

1,  . ,  100  )  for  each  fixed  number  of  trials  (  n  = 

50,  100,  150,  200,  250  )  at  different  proportion  values 

(  p  =  .05,  .10,  . ,  .95  ).  Specifically,  for  instance, 

Xj=2,  X2=  3,  A'j=3, . ,  Xjgg—A  for,  say.  n  =  50 

at  p=.05  .  The  computational  scheme  is  outlined 
below; 

x^.  =  A./n  =  p^.  for  j=  1,2 . 100  (3.1) 

Qj  =  -  Qie  Q,  /lOO  (3.2) 

E{L)  =  Y:L.  /lOO  (3.3) 


Hence,  two  lengths  of  confidence  interval  are 
computed  for  comparison.  Meantime,  the  lower  limits 
are  checked  for  negative  values  as  it  is  meaningless  to 
have  negative  proportion  vlaues. 


4.  Results 

The  results  are  reported  below  in  tabular  forms.  All 
numbers  are  significantly  different  from  zero  at  q  < 
.0000001  . 


Table  1  Confidence  Lengths  for  p  =  .05  to  .95 


n  =  50 

n  =  100 

p= 

E{Q)^ 

E{L)= 

E{Q)= 

E{L)= 

.05 

.129240 

.111604 

.086883 

.081212 

.10 

.160865 

.153494 

.118447 

.116699 

.15 

.196048 

.196297 

.138551 

.138601 

.20 

.216052 

.219484 

.155206 

.156499 

.25 

.228749 

.233905 

.167382 

.169501 

.30 

.245147 

.252505 

.175688 

.178.3.38 

.35 

.251126 

.259244 

.182566 

.185612 

.40 

.259527 

.268688 

.187935 

.1913.35 
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.45 

.262571 

.272100 

.190823 

.194393 

Table  2.  Confidence  Lengths  for  p=.141  to 

.150 

.50 

.264110 

.273823 

.191229 

.194823 

..55 

.263677 

.273339 

.190303 

.193842 

n=50 

n=100 

.00 

.258847 

.267923 

.187152 

.190505 

P= 

E(Q)= 

EiL)= 

EiQ)= 

E{L)= 

.65 

.252268 

.260526 

.182632 

.185712 

.141 

.188695 

.187642 

.134747 

.134477 

.70 

.242492 

.249517 

.176238 

.178923 

.142 

.185552 

.183980 

.133452 

.133074 

.75 

.228482 

.233639 

.164684 

.166623 

.143 

.192676 

.192358 

.135125 

.134881 

.80 

.213716 

.216753 

.153372 

.154526 

.144 

.190294 

.189785 

.137330 

.137280 

.85 

.191278 

.190726 

.138976 

.139064 

.145 

.190473 

.189798 

.137713 

.137706 

.90 

.162923 

.156850 

.119114 

.117444 

.146 

.196234 

.196575 

.138436 

.138469 

.95 

.127037 

.107727 

.090024 

.094820 

.147 

.184484 

.182840 

.136524 

.136401 

.148 

.193216 

.193015 

.137517 

.137468 

.149 

.191201 

.190609 

.138854 

.138921 

n=150 

ji=200 

.150 

.197258 

.197718 

.139562 

.139674 

/'= 

E{Q)= 

EiL)=^ 

E(Q)= 

E(L)= 

.05 

.073089 

.070345 

.059756 

.057695 

n=]50 

n=200 

.10 

.096291 

.095323 

.083232 

.082603 

P= 

EiQ)= 

E(L)= 

E(Q)= 

E{L) 

.15 

.113671 

.113716 

.099044 

.099094 

.141 

.111916 

.111867 

.097674 

.097671 

.20 

.127443 

.128165 

.110006 

.110455 

.142 

.110867 

.110761 

.096849 

.096812 

.25 

.135893 

.136994 

.118682 

.119423 

.143 

.109992 

.109845 

.097173 

.097148 

.30 

.144120 

.145572 

.125345 

.126298 

.144 

.111087 

.110995 

.096666 

.096622 

.35 

.150863 

.152591 

.130473 

.131584 

.145 

.110784 

.110675 

.098194 

.098211 

.40 

.154345 

.156211 

.133971 

.135186 

.146 

.111497 

.111430 

.097714 

.097709 

.45 

.156761 

.1.58723 

.136266 

.137549 

.147 

.111510 

.111437 

.096706 

.096664 

..50 

.157607 

.159602 

.137054 

.138360 

.148 

.112772 

.112768 

.098053 

.098065 

.55 

.156862 

.158828 

.136222 

.137503 

.149 

.113501 

.113536 

.097940 

.097947 

.60 

.154620 

.156498 

.134362 

.135589 

.150 

.113884 

.113940 

.098267 

.098286 

.65 

.150248 

.151951 

.130681 

.131798 

.70 

.144797 

.146278 

.125889 

.126860 

n=250 

.75 

.136504 

.137632 

.118849 

.119596 

P= 

E{L)= 

.80 

.125992 

.126643 

.110431 

.110897 

.141 

.086157 

.086115 

.85 

.114032 

.114097 

.098272 

.098292 

.142 

.086490 

.086459 

.90 

.094655 

.093598 

.083051 

.082419 

.143 

.086628 

.086600 

.95 

.072323 

.069487 

.062522 

.060699 

.144 

.086604 

.086575 

.145 

.086900 

.086881 

.146 

.087078 

.087064 

n=250 

.147 

.086871 

.086852 

F= 

E{Q)= 

E(L)= 

.148 

.087173 

.087163 

.05 

.053906 

.052499 

.149 

.088275 

.088299 

.10 

.074071 

.073608 

.150 

.088051 

.088069 

.15 

.088089 

.088108 

.20 

.097574 

.097872 

.25 

.107009 

.107560 

Table  3.  Confidence  Lengths 

for  p=.848  to 

.857 

.30 

.112697 

.113392 

.35 

.117070 

.117871 

n=50 

n=100 

.40 

.120240 

.121117 

P- 

EiQ)= 

E(L)= 

EiQ) 

E{L) 

.45 

.122119 

.123040 

.848 

.194058 

.194033 

.138874 

.138952 

..50 

.122779 

.12.3715 

.849 

.195307 

.195373 

.137464 

.137415 

.55 

.122190 

.123112 

.850 

.191125 

.190103 

.139139 

.1,39223 

.60 

.120.3.58 

.1212.37 

.851 

.195690 

.195666 

.135951 

.134787 

.65 

.117629 

.118444 

.852 

.195164 

.195387 

.136982 

.1.36902 
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.853 

.193889 

.193788 

.136169 

.136023 

.854 

.191018 

.190548 

.137492 

.137440 

.855 

.188169 

.187087 

.134055 

.133727 

.856 

.189168 

.188288 

.136403 

.136265 

.857 

.188540 

.187527 

.136354 

.136236 

n=150 

n=200 

/'= 

E(Q)= 

E{L)= 

E{Q)= 

E{L)= 

.848 

.114195 

.114264 

.099455 

.099521 

.849 

.113754 

.113802 

.099357 

.099421 

.850 

.112745 

.112745 

.098589 

.098622 

.851 

.113870 

.113926 

.098091 

.098104 

.852 

.113628 

.113668 

.098187 

.098204 

.853 

.111333 

.111254 

.098315 

.098336 

.854 

.111889 

.111840 

.097823 

.097828 

.855 

.111643 

.111579 

.097202 

.097178 

.8.56 

.110922 

.110815 

.097405 

.097390 

.857 

.111827 

.111776 

.096491 

.096441 

n=250 

/'= 

E(Q)= 

E{L)= 

.848 

.089044 

.089091 

.849 

.088881 

.088923 

.850 

.088370 

.088397 

.851 

.088361 

.088387 

.852 

.087940 

.087954 

.853 

.088464 

.088494 

.854 

.086444 

.096412 

.855 

.087600 

.087604 

.856 

.085962 

.085912 

.857 

.085957 

.085909 

'Fable 

4  Percentage  of  Lj 

<  0 

V\  n= 

50 

100 

150 

200 

250 

.05 

68 

25 

3 

2 

0 

.10 

30 

0 

0 

0 

0 

.11 

26 

1 

0 

0 

0 

.12 

16 

0 

0 

0 

0 

.13 

13 

1 

0 

0 

0 

.14 

3 

0 

0 

0 

0 

.15 

6 

0 

0 

0 

0 

.16 

6 

0 

0 

0 

0 

.17 

0 

0 

0 

0 

0 

.18 

1 

0 

0 

0 

0 

.19 

0 

0 

0 

0 

0 

.99 

0 

0 

0 

0 

0 

5.  Conclusion 

It  can  be  inferred  from  data  presented  in  results  that 
E(L)  >  E{Q)  in  general.  However,  in  the  ranges  of  the 
proportion  values  in  Table  5  below,  E{Q)  is  greater 
than  E(L). 

Table  5  Ranges  of  E!{Q)  >  E{L) 


n= 

P  < 

P  > 

50 

.146 

.852 

100 

.146 

.850 

150 

.149 

.852 

200 

.145 

.854 

250 

.149 

.855 

This  is  quite  consistent  with  the  conclusion  in  Alt  and 
Walker  (1981).  The  difference  between  the  two  papers 
is  that  this  paper  uses  the  direct  expected  length  while 
that  paper  uses  the  expected  squared  length. 

The  problem  of  negative  lower  confidence  limit  is  only 
with  when  the  proportion  value  is  low  and  the 
number  of  trials  n  is  as  small  as  50.  If  the  number  of 
trials  n  is  above  250,  the  problem  will  disappear 
entirely.  See  Table  4. 
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Abstract 

Suppose  that  the  evolution  of  a  character  pos¬ 
sessed  by  a  number  of  current  species  is  modelled 
zis  a  Markov  random  field  on  an  evolutionary  tree. 
Suppose  that  for  each  pair  of  current  species  we 
know  the  joint  probability  distribution  of  the  pair 
of  characters  possessed  by  that  pair  of  species. 
We  give  conditions  under  which  the  evolution¬ 
ary  tree  can  be  reconstructed  from  knowledge  of 
these  pairwise  joint  distributions,  that  is,  condi¬ 
tions  under  which  there  is  only  one  evolutionary 
tree  topology  consistent  with  the  given  pairwise 
distributions.  In  this  way  we  establish  consis¬ 
tency  of  a  method  for  reconstructing  evolution¬ 
ary  trees  using  pairwise  distributions  estimated 
from  observed  homologous  DNA  sequences. 

1  Introduction 

Evolutionary  relationships  among  species  are 
commonly  conceptualized  in  terms  of  an  “evolu¬ 
tionary  tree.”  A  tree  consists  of  nodes  and  arcs. 
The  degree  of  a  node  is  the  number  of  arcs  inci¬ 
dent  to  the  node.  Nodes  of  degree  one  are  ter¬ 
minal  nodes,  and  nodes  of  higher  degree  are  in¬ 
ternal  nodes.  In  an  evolutionary  tree,  the  termi¬ 
nal  nodes  are  labelled  by  current  species  observ¬ 
able  today,  and  the  internal  nodes  correspond  to 
ancestral  species.  We  eissume  speciation  events 
occur  at  internal  nodes,  so  that  we  do  not  allow 
nodes  of  degree  two.  The  sc.entific  problem  of  in¬ 
terest  to  us  is  to  infer  the  evolutionary  tree  relat¬ 
ing  a  given  set  of  current  species.  This  inference 
is  to  be  based  on  data,  which  might  typically  be 
a  set  of  observed  DNA  sequences,  one  from  each 
of  the  given  current  species.  For  most  of  the  pa¬ 
per,  we  will  restrict  our  attention  to  the  topology 
of  the  tree  together  with  the  labels  of  the  termi¬ 


nal  nodes.  In  particular,  we  are  not  interested  in 
length  of  time,  direction  of  time,  or  the  root  of 
the  tree. 

Let  T  denote  a  finite  set  of  current  species, 
let  C  denote  a  finite  set  of  characters,  and  for 
each  t  €  T  let  Xt  denote  the  character  pos¬ 
sessed  by  species  t.  For  example,  C  might  be 
the  set  of  four  nucleotides,  and  Xt  might  identify 
the  nucleotide  occupying  a  particular  site  in  the 
DNA  of  a  representative  of  species  t.  We  consider 
{Xi  :  t  6  T}  to  be  random  variables  generated 
by  a  Markov  random  field  model  on  an  evolution¬ 
ary  tree.  To  describe  the  model,  we  begin  with 
the  tree  T  =  (S,  A),  characterized  by  its  set  of 
nodes  (or  species)  S  and  its  set  of  arcs  A.  S  may 
be  decomposed  into  the  union  S  —  Tu  N  of  the 
set  T  of  terminal  nodes  and  the  set  N  of  non¬ 
terminal  nodes;  since  current  species  correspond 
to  terminal  nodes,  there  is  no  conflict  with  the 
notation  T  introduced  above.  Each  arc  a  G  A  is 
undirected  and  may  be  represented  as  a  subset 
{r,  s}  containing  two  distinct  nodes  r,s  £  S.  For 
each  s  G  5  let  X,  be  a  random  variable  taking 
values  in  C.  We  assume  that  {A%  :  s  G  S}  is  a 
Markov  random  field  on  T,  which  means  that 
for  each  s  G  5  the  conditional  distribution  of 
X,  given  all  of  the  other  values  {Xr  :  r  s} 
is  the  same  as  the  conditional  distribution  of 
A”,  given  just  the  values  {A%  :  {r,  s)  G  A)  at 
the  “neighbors”  of  s.  This  completes  the  de¬ 
scription  of  the  probabilistic  model  for  the  evo¬ 
lution  of  a  single  character.  In  general  we  ob¬ 
serve  n  characters  for  each  species.  In  this  case, 
we  make  the  standard  but  undoubtedly  unrealis¬ 
tic  assumption  that  distinct  characters  are  inde¬ 
pendent  and  identically  distributed  (iid),  that  is, 
we  imagine  that  A'^, . . . ,  A"  are  iid,  where  each 
A’  =  {A'J  :  s  G  S)  is  a  Markov  random  field  on 
T. 
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The  following  brief  remarks  about  methods 
of  evolutionary  tree  reconstruction  are  intended 
to  provide  some  context  for  this  work;  the  ex¬ 
cellent  survey  of  Felsenstein  (1988)  should  be 
consulted  for  more  background  and  references. 
The  most  popular  method  is  probably  the  parsi¬ 
mony  method  of  Camin  and  Sokal  (1965),  which 
chooses  a  tree  in  which  characters  can  be  cissigned 
to  nodes  so  that  the  number  of  >  ’.langes  of  charac¬ 
ter  across  arcs  of  the  tree  is  minimal  over  all  trees. 
This  method  ha5  the  very  considerable  virtue  of 
ease  of  implementation.  However,  the  unfortu¬ 
nate  truth  observed  by  Felsenstein  (1978)  is  that 
parsimony  is  inconsistent;  in  fact,  Felsenstein  ex¬ 
hibited  an  example  in  which  the  probability  that 
parsimony  would  choose  an  incorrect  tree  ap¬ 
proached  one  as  the  number  n  of  observed  char¬ 
acters  per  species  approached  infinity.  A  nicix- 
imum  likelihood  method  for  the  present  model 
was  considered  by  Barry  and  Hartigan  (1987b). 
This  method  overcomes  parsimony’s  defect  of  in¬ 
consistency,  at  the  cost  of  a  great  increcise  in 
computational  difficulty.  The  distance  method 
of  Barry  and  Hartigan  (1987a),  which  will  be  de¬ 
scribed  more  fully  below,  is  intermediate  between 
parsimony  and  maximum  likelihood  in  terms  of 
computational  difficulty.  The  question  that  orig¬ 
inated  the  present  investigations  was  whether  the 
distance  method  is  consistent.  This  question  will 
be  addressed  in  section  3. 


2  Identifiability  of  the  Tree 

A  principal  ingredient  in  the  consistency  proof 
is  an  “identifiability”  result  that  says  that  under 
the  ctssumptions  of  our  Markov  model  and  cer¬ 
tain  other  conditions,  distributions  of  pairs  of  the 
form  (A'/,  A”,,),  where  t  and  u  are  terminal  nodes, 
determine  the  evolutionary  tree.  This  result  in 
turn  follows  from  Lemma  1  below,  which  says 
that  knowing  the  values  of  an  “additive  function” 
on  pairs  of  terminal  nodes  of  a  tree  is  enough  to 
determine  the  tree. 

The  statement  of  Lemma  1  requiri  some  def¬ 
initions.  Let  Ti  =  (5i,  Ai)  and  T2  =  A2)  be 

two  trees  with  S\  —  Tu  Ni  and  5-2  =  T  U  N2, 
so  that  the  terminal  nodes  of  Tj  and  T2  are 
the  same.  We  say  that  Ti  and  T2  are  equiva¬ 
lent  if  there  is  a  bijective  “relabelling”  function 
p  :  S\  —*  S2  such  that  p(t)  =  t  for  all  /  G  T  and 
A2  -  {{p(r),p(s)}  :  {r,  s}  G  A]}.  In  this  case 


we  write  Ti  ~  T^.  Thus,  Ti  ~  7^  means  that  Ti 
and  72  are  the  same  up  to  a  possible  relabelling 
of  nonterminal  nodes. 

Next,  for  a  given  tree  T  =  (S,  A),  let  A  = 
{(r,  s)  :  {r,  s)  G  A)  be  the  set  of  directed  arcs  of 
T.  Then  for  all  distinct  r,s  E.  S  either  ir(r,  s)  ;= 
{(r,  s)}  C  A  or  there  is  a  unique  n  >  1  and  a 
unique  sequence  si, . . . ,  s„  of  distinct  nodes  such 
that  5r(r,  s)  :=  {(r,  si),  (si,  S2),  ■  ■  • ,  (sn, «)}  C  A. 
This  defines  ir(r,  s),  the  path  from  r  to  s.  We  say 
that  a  function  /  :  S  x  5  — »  IR  is  additive  on  the 
tree  T  if  for  all  r,s  £  S  we  have 

/(r,s)=  /(«)’ 

a€»(r,») 

with  the  sum  being  defined  to  be  0  if  r  =  s. 

Lemma  1  Let  Ti  =  (Si,  Aj)  and  T2  =  (52,A2) 
be  two  evolutionary  trees  with  the  same  set  of 
terminal  nodes  T.  Suppose  there  exist  functions 
/i  :  St  X  Si  — ♦  IR  and  /2  :  S2  x  S2  — *  IR  such 
that 

!■  fi{r,s)  -f  fi[s,r)  ^  0  for  all  {r,  s)  G  A  and 

t  =  1,2 

2.  fi  is  additive  on  T;  for  i  =  1,2 

3.  fi{t,  u)  =  /2(<,  u)  for  all  t,u£T. 

Then  7i  ~  72. 

Results  appearing  in  the  papers  of  Dobson 
(1974)  and  Sattath  and  Tversky  (1977)  are 
clearly  closely  allied  but  apparently  not  the  same. 
They  focus  on  existence  of  trees  satisfying  certain 
conditions,  while  we  are  interested  in  uniqueness. 
We  also  find  the  fact  that  our  additive  function 
need  not  be  nonnegative  to  be  interesting. 

The  key  to  the  identifiability  result  mentioned 
above  is  the  notion  of  distance  introduced  by 
Barry  and  Hartigan  (1987a),  which  takes  the 
form 

d(r,s)=  -(1/4)  log[det(P”)] 

for  four-valued  characters,  where  P''  is  the 
Markov  transition  matrix  whose  {i,  j)th  entry  is 
P{A',  =  /jA'r  =  z).  For  the  Markov  model  we 
have  assumed,  d  is  an  additive  function.  The 
identifiability  result  is  then  just  the  statement  ob¬ 
tained  by  taking  the  additive  function  in  Lemma 
1  to  be  Barry  and  Hartigan’s  distance.  The  con¬ 
ditions  required  are 


det(P")>0  for  {r,s}GA  (2.1) 
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and 

det(P''’)det(P'’')  <  1  for  (2.2) 

Under  conditions  (2.1)  and  (2.2),  the  pairwise 
distributions  of  characters  at  terminal  nodes  de¬ 
termine  the  evolutionary  tree.  Condition  (2.2) 
corresponds  to  condition  1  of  the  lemma.  To  get 
an  idea  of  its  significance,  note  that  an  example 
of  a  situation  it  rules  out  is  P''*  =  P"  =  I  for 
two  internal  nodes  r  and  s  joined  by  an  arc.  This 
is  reasonable,  since  in  such  a  case  we  could  elim¬ 
inate  the  arc  {r,  s}  and  combine  nodes  r  and  s 
into  one  node  without  changing  any  probability 
distributions  at  the  terminal  nodes  of  the  tree. 
Condition  (2.1)  ensures  that  the  logarithm  in 
Barry  and  Hartigan’s  distance  is  defined.  This 
would  presumably  hold  in  biologically  realistic 
models,  for  example,  models  in  whicli  characters 
evolve  as  a  Markov  chain  in  continuous  time.  In 
any  case,  both  conditions  (2.1)  and  (2.2)  may 
be  relaxed  by  the  device  of  using  the  distance 
d’(r,s)  =  — (1/4)  logi  det(P’'’)|  in  place  of  d. 
The  resulting  conditions  would  be  det(P''')  yt  0 
and  I  det(P’'')  det(P'"')|  <  1  for  {r,  s}  G  A. 

3  Consistency 

The  method  Barry  and  Hartigan  (1987a)  propose 
for  choosing  an  evolutionary  tree  from  given  data 
applies  the  least  s<iuares  idea  of  Cavalli-Sforza 
and  Edwards  (19G7)  in  tlie  following  manner.  For 
each  ordered  pair  (t,ii)  of  terminal  nodes,  form 
the  estimated  distances 

d(/,n)  =  -(l/4)log[det(P'“)],  (3.1) 

where  P‘“  is  the  usual  empirical  estimate  of  P'". 
For  a  candidate  tree  T  =  (5,  4)  under  consid¬ 
eration,  for  each  (?’,  .s)  G  A  introduce  a  variable 
Xr,.  Define  the  "departure  from  additivity”  of 
the  tree  T  to  be  the  minimum  of  the  quantity 

(d(t,  u)-  Y 

\  ( r,ji)e  IT ( (.«)  / 

over  all  possible  rabies  of  the  variables  x^,- 
Choose  the  tree  having  the  smallest  departure 
from  additivity. 

To  state  the  consistency  result,  let  T  rlenote 
the  true  evolutionary  tree,  and  as  usual  a.ssuine 
that  A’’,  A’^, . . .  are  iid  Markov  random  fi<-lils  on 


T.  Suppose  that  T"  denotes  the  estimate  given 
by  a  tree  reconstruction  method  when  applied  to 
the  data  A'^, . . . ,  A'",  We  say  that  the  method  is 
strongly  consistent  if  with  probability  one  there 
is  a  finite  N  such  that  T"  =  T  for  all  n  >  N. 

Theorem  2  Suppose  the  true  evolutionary  tree 
T  —  (5,^1)  is  bifurcating,  that  is,  all  internal 
nodes  have  degree  3.  Then  under  eonditions  (2.1) 
and  (2.2),  the  method  of  Barry  and  Hartigan 
(1987a)  is  strongly  consistent. 

The  assumption  that  the  true  tree  is  bifurcat¬ 
ing  rules  out  nodes  of  degree  higher  than  3.  This 
restriction  is  necessary  for  the  following  reason. 
Lemma  1  states  that  different  trees  cannot  have 
e.vactly  the  same  distances  between  pairs  of  ter¬ 
minal  nodes.  However,  if  the  true  tree  has  nodes 
of  degree  higher  than  3,  then  there  are  different 
trees  that  may  have  distances  arbitrarily  close  to 
the  true  distances.  If  the  true  tree  is  bifurcating 
and  the  true  model  satisfies  the  conditions  (2.1) 
and  (2.2),  then  there  are  no  such  different  trees. 

4  No  Information  in  Asym¬ 
metry 

The  distance  d  is  asymmetric  in  general:  J(r,  s)  ^ 
d{s,r).  Since  the  symmetrized  distance  function 

(l{r,s)  =  [d(r,  s)  -I-  d{s,  r)]/2 

is  also  additive,  one  could  work  with  d  rather 
than  d.  and  effectively  cut  in  half  the  number 
of  equations  and  unknowns  in  each  lc^lst  squares 
calculation.  However,  replacing  d  by  d  involves 
ignoring  some  of  the  information  iti  the  data,  so 
tliat  one  might  suspect  that  we  would  pay  for  the 
gain  in  computational  sinqdicity  by  sacrificing  ef¬ 
ficiency.  It  turns  out  that  as  far  as  the  method 
of  Barry  and  Hartigan  is  concerned,  no  efficiency 
at  all  is  lost  by  symmetrizing.  The  rea.son  is  con¬ 
tained  in  the  following  result. 

Proposition  3  Let  a  tree  T  having  terminal 
nodes  T  be  given.  Supjm.-n  we  are  al.io  given 
the  c.ttimatcd  distc  ■I'-e.s  d{t.u)  of  (3.1)  for  all 
t.  II  G  T.  For  additive  functions  f  on  T  define 
S(f)  and  S{f)  to  be 

Y  [f{f.n)-d(t.n]Y 

<.ii6  I 
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and 


E 

t,ueT 


f{t,u)  +  f(u,t)  d{t,u)  +  d{u,t) 


respectively.  Then  we  have 


inf  S(/)  =  inf  S(/), 


where  the  infima  are  taken  over  all  functions  f 
additive  on  T . 


5  Identifiability  of  the  Full 
Model 

Although  section  2  showed  that  pairwise  distri¬ 
butions  over  terminal  nodes  determine  a  tree, 
it  is  interesting  that  such  pairwise  distributions 
do  not  determine  the  full  model,  that  is,  they 
do  not  determine  the  Markov  transition  matri¬ 
ces  P'‘  for  {r,  s}  €  A.  This  can  be  seen  in  the 
smallest  nontrivial  case:  a  tree  having  3  terminal 
nodes  T  =  {a,  6,  c}  and  one  nonterminal  node 
N  =  {m},  say.  Begin  with  an  “original”  model 
having  marginal  probability  vector  tt™  at  node 
m  and  Markov  transition  matrices  P'"®,  P"**, 
and  P"”^.  These  specifications  determine  the 
complete  joint  distribution  of  the  Markov  ran¬ 
dom  field  {A'a,  Xi,  Xc,  A'r,)},  and  in  particular  the 
Markov  transition  matrices  P”"*,  P^"*,  and  P*^™. 
Let  1  denote  a  vector  of  ones  and  let  a  prime 
denote  transpose.  Then  it  turns  out  that  if  R  is 
an  invertible  matrix  satisfying  the  conditions 

1.  PI  =  1 


Proposition  4  Let  T\  —  (Si,Ai)  and  7^  = 
(S2,  A2)  be  two  evolutionary  trees  with  the  same 
set  of  teinninal  nodes  T.  For  i=l  and  2,  let 
Xi  =  {A’i(s)  :  s  6  Si}  be  a  Markov  random  field 
on  Ti  taking  on  two  values.  Suppose  that  con¬ 
ditions  (2.1)  and  (2.2)  hold  with  P" ,  P'”,  and 
A  replaced  by  P(‘ ,  P('' ,  and  Ai,  respectively,  for 
i=l,2.  Suppose  also  that  the  joint  distributions 
o/{A'*(<)  :  t  £  T}  and  {A’^^(t)  :  t  G  T)  are  the 
same.  Then  T\  ^  T2,  and  we  have  equality  of 
the  full  joint  distributions  of  {A'i(s)  :  s  G  Si) 
and  {X2ipis))  :  s  £  5]},  where  p  is  the  “rela¬ 
belling”  function  in  the  definition  of  the  equiva¬ 
lence  Ti  ~  T2. 

We  conjecture  that  a  similar  result  holds  for  the 
case  of  more  than  two  characters. 
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Abstract 

This  paper  briefly  examines  current  methodology 
for  developing  genetic  linkage  maps  and  using  them 
to  find  loci  for  quantitative  traits  (QTL).  Maximum 
likelihood  interval  mapping  is  viewed  as  an  extension 
of  classical  least  squares  methods  when  the  trait  of 
interest  is  normally  distributed  and  located  near  a 
genetic  marker.  Some  problems  in  finding  multiple 
loci  for  a  quantitative  trait  are  examined  for  days  to 
budding  as  measured  on  F2  plants  from  a  Brassica 
rapa  cross.  Relevant  design  aspects  of  molecular  bi¬ 
ology  experiments  are  briefly  noted. 

Introduction 

Last  year  colleagues  in  the  plant  sciences  ap¬ 
proached  me  with  questions  about  recently  developed 
programs  for  locating  quantitative  traits  on  genetic 
linkage  maps  (Lander  and  Botstein,  1989).  They 
wanted  to  know  how  this  related  to  “classical”  ap¬ 
proaches,  and  whether  this  new  maximum  likelihood 
approach  was  more  appropriate. 

The  present  study  concerns  the  cross  of  two  va¬ 
rieties  of  Brassica  rapa,  a  Michihili  Chinese  cabbage 
(M)  female  with  a  Spring  broccoli  (S)  male  plant,  pro¬ 
ducing  a  single  FI  offspring  which  was  self-pollinated. 
The  resultant  F2  seeds  were  germinated,  yielding  95 
plants  which  were  measured  for  various  phenotypic 
traits  (observable  characteristics  such  as  day.s  to  first 
flower,  days  to  budding,  etc.).  The  F2s  were  assayed 
by  297  restriction  fragment  length  polymorphisms, 
or  RFLPs,  which  were  drawn  from  previous  studies 
of  Brassica,  DNA  from  both  grandparents  or  the  FI 
parent,  or  selfed  progeny  of  same. 

A  genetic  linkage  map  was  constructed  (Song  et 
al.,  1990)  which  locates  markers  relative  to  one  an¬ 
other  based  on  the  frequency  of  genetic  recombina¬ 
tion  (crossover  of  chromosome  pairs  during  meiosis) 

•Research  supported  by  USDA-CSRS  grant  51 1-100.  Con¬ 
ference  accomodations  supported  in  part  by  Interface  Founda¬ 
tion  of  North  America.  Special  thanks  to  Kerning  Song  and 
Mary  Slocum  for  providing  data. 


on  the  ten  chromosome  pairs  in  this  genome.  They 
measured  the  association  between  the  RFLP  patterns 
for  each  marker  and  the  phenotypic  traits  and  used 
the  genetic  linkage  map  to  find  probable  sites,  or  loci, 
of  genes  controlling  those  traits  (Song,  Slocum  and 
Osborn,  1991). 

The  purpose  of  this  paper  is  to  examine  the  statisti¬ 
cal  properties  of  such  quantitative  trait  loci  (QTL).  In 
particular  we  show  the  connection  between  the  clas¬ 
sical  regression  model  at  markers  and  the  maximum 
likelihood  interval  mapping  method  presented  in  Lan¬ 
der  and  Botstein  (1989)  and  discuss  some  inferential 
questions  concerning  confidence  intervals  and  finding 
multiple  loci  (major  and  minor  genes)  which  control 
days  to  budding. 

RFLPs  and  Linkage  Maps 

Chromosomes  come  in  pairs,  and  offspring  inherit 
one  of  a  pair  from  each  parent.  Any  locus  on  a  chro¬ 
mosome  pair  heis  two  “alleles,”  or  forms  of  DNA,  one 
from  each  parent.  The  FI  was  “heterozygous”  (had 
two  different  alleles)  at  each  marker  locus  and  pre¬ 
sumably  at  all  QTL.  F2  plants  inherit  (via  FI)  both 
alleles  at  a  locus  from  one  grandparent  (MM  or  SS) 
or  one  from  each  (MS).  Genetic  recombination  can 
lead  to  different  allele  types  along  the  same  chromo¬ 
some,  which  is  exploited  to  generate  RFLP  linkage 
maps  (Lander  and  Botstein,  1989). 

RFLP  involves  digesting  DNA  with  an  enzyme  and 
using  discrepant  fragment  lengths  as  markers  for  ge¬ 
netic  differences  among  individuals.  The  enzyme  cuts 
DNA  adjacent  to  a  specific  base  pair  pattern,  say 
JICGTAT.  A  change  (mutation  or  recombination)  in 
this  restriction  site  for  one  variety  (say  ACTTAT)  would 
be  missed  by  the  enzyme,  resulting  in  one  long  frag¬ 
ment  rather  than  two  shorter  ones — a  polymorphism. 
Other  forms  of  DNA  rearrangement  between  restric¬ 
tion  sites  (e.g.  insertion/deletion/transposition)  can 
also  create  polymorphisms.  DNA  fragments  are  sep¬ 
arated  by  size  on  a  Southern  blot  and  “probed”  by 
^^P-labelled  DNA  pieces  which  bond  to  homologous 
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DNA  fragments.  Ideal  genetic  markers  are  probes 
which  highlight  exactly  one  RFLP,  scoring  F2s  at 
this  marker  as  MM,  SS  (parent  types)  or  MS  (hybrid, 
having  both  length  fragments).  However,  RFLP  pat¬ 
terns  may  be  difficult  to  align  (Branscomb,  1991)  or 
distinguish  (Figure  1). 

Nearly  adjacent  genetic  markers  should  largely 
agree  in  allele  type  across  the  F2s,  with  differences 
probably  due  to  recombination  between  the  mark¬ 
ers.  Distance  is  roughly  proportional  to  the  fre¬ 
quency  of  recombination;  1%  «  1  centi-Morgan  (cM) 

10®  —  10®  DNA  base  pairs  (may  vary  along  a  chro¬ 
mosome  and  between  species).  Genetic  linkage  maps 
are  currently  constructed  by  examining  pairs  of  mark¬ 
ers,  then  triplets,  then  piecing  together  whole  chro¬ 
mosomes  (Lander  and  Green,  1987;  Song  et  al.  1991). 

Lander’s  MAPMAKER  program  provides  a  user- 
friendly  environment  for  this  empirical  maximization 
of  the  joint  likelihood  of  all  marker  loci  across  the 
genome.  Interesting  questions  remain  about  further 
optimizing  the  search  algorithm  and  ascertaining  that 
it  converges  to  a  unique  global  maximum. 

QTL,  LOD  and  MLE 

Genetic  linkage  maps  are  used  to  find  quantitative 
traits  loci,  or  QTL.  The  mean  for  a  single  locus  quan¬ 
titative  trait  y  depends  on  the  allele  type,  i.e., 

E{y)  =  ti+ax  +  d{l-  li|),  l/(y)  =  cr^ , 

with  X  =  1,  —1, 0  if  the  locus  is  of  grandparent  type 
MM,  SS,  or  hybrid  (MS),  respectively.  Here  /i  is  the 
reference  mean,  a  the  additive  allelic  effect  and  d  the 
dominance  effect  of  allele  type  M.  Under  the  null  hy¬ 
pothesis  of  no  QTL  (a  =  d  =  0),  the  F-statistic 

F  =  [y~^(t/  -  y)^/t^i]/d-^  is  distributed  as 

with  y  the  sample  mean,  y  the  least  squares  estimate, 
the  variance  estimate  and  degrees  of  freedom  vi  — 
2  and  1/2  =  n  —  3  for  n  F2s  scored  at  this  locus.  For 
normal  y,  this  is  equivalent  to  the  likelihood  ratio 
statistic,  typically  presented  in  human  genetics  as 

LOD  =  \og^Q{likelihood  ratio) 

=  [0.5  tiiy-  j/)  iog(  10) 

=  «/,F/2log(10)  . 

For  normally  distributed  traits,  these  two  ap¬ 
proaches  are  equivalent  and  exact.  Transformations 
toward  normality  are  used  in  practice.  Qualitative 


traits  (counts  and  +/— )  should  use  the  “deviance” 
(McCullagh  and  Nelder,  1983)  instead  of  the  sum  of 
squares.  In  some  cases,  this  reduces  to  a  test  on 
two-way  frequency  tables  at  each  marker  locus. 

The  “classical  approach”  computes  the  F-statistic 
at  all  markers,  concluding  that  a  QTL  is  near  the 
marker  locus  with  the  most  significant  value.  Lan¬ 
der  and  Botstein  (1989)  expanded  the  normal  model 
to  examine  intervals  between  marker  loci.  Consider 
markers  m  and  m'  with  recombinant  frequency  r  and 
indicators  x  and  x' .  A  QTL  with  pr  recombination 
with  m  has  conditional  expectation 

E{y\r,  m,  m')  =  fi  +  a[(l  -  p)x  +  px']  +  df(x,  x';  p,  r), 

where  /  is  complicated  but  tractable  (Knapp,  Bridges 
and  Birkes,  1990).  However,  conditional  on  position 
(p  and  r)  the  model  is  linear  in  parameters  a  and  d. 

Lander’s  MAPMAKER/QTL  program  profiles  the 
likelihood  (cf.  Kalbfleisch  and  Sprott,  1970;  Bates 
and  Watts,  1988)  across  intervals  for  adjacent  mark¬ 
ers  on  the  linkage  map,  with  the  maximum  likelihood 
estimator  (MLE)  corresponding  to  the  highest  peak. 
At  the  MLE,  if  a  =  d  =  0  then 

max(IC>D)  «  [j/i/2  log(10)]F,,,,,,,  X^y21og(10), 

with  the  latter  approximation  used  in  practice,  ignor¬ 
ing  the  extra  variation  of  the  estimate 

Confidence  Regions  for  QTL 

Confidence  regions  arise  by  inverting  the  probabil¬ 
ity  statement  Pr{max(LOD)  <  Ca)  =  1  —  a.  The 
99%  theoretical  confidence  region  for  the  major  QTL 
of  the  phenotypic  trait  “days  to  budding”  lies  on  chro¬ 
mosome  3  (Figure  2a),  primarily  around  100  cM  but 
with  small  intervals  around  60  and  80  cM.  These  in¬ 
tervals  have  LOD  scores  at  least  max(LOD)  —  2,  with 
coi  SS  X2;.oi/2^off(10)  SS  2.  Some  QTL  had  confi¬ 
dence  regions  spanning  intervals  on  several  chromo¬ 
somes.  Beware  that  such  regions  may  be  too  nar¬ 
row,  having  much  smaller  coverage  probabilty  than 
expected  (Terry  Speed,  pers.  comm.). 

The  LOD  score  should  be  roughly  quadratic  near 
the  true  locus.  In  practice,  the  profile  is  quite  ir¬ 
regular  (Figure  2a)  and  the  profile  traces  (Bates  and 
Watts,  1988;  Ritter,  Bisgaard  and  Bates,  1991)  for  a 
and  d  exhibit  strong  nonlinearity  and  some  numerical 
problems  (Figure  3).  This  suggests  caution  in  inter¬ 
preting  the  parameter  estimates  from  current  meth¬ 
ods,  and  a  need  for  some  refinement. 
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Major  and  Minor  QTL 

Finding  multiple  loci  which  control  a  quantitative 
trait  is  a  stepwise  process  in  which  one  identifies  the 
major  QTL,  removes  its  effect,  then  proceeds  to  the 
most  important  minor  QTL,  and  so  on.  For  the  two- 
loci  additive  model  (ignoring  interval  mapping), 

E{,y)  =  n  +  aixi  +  di{l-  [xil)  -|-  03X2  +  d2(2  -  |x2l), 

where  the  major  (1)  and  minor  (2)  loci  may  be  on  dif¬ 
ferent  chromosomes.  The  LOD  can  be  decomposed, 
L0D{1,2)  =  LOD(l)  -f-  LOD(2\\),  suggesting  that 
one  fit  the  major  locus  model  (as  yi)  and  then  con¬ 
ditionally  fit  the  minor  locus, 

E{y  -  yilxi)  =  02X2  +  d2(2  -  |x2|)  . 

If  the  two  loci  were  on  separate  chromosomes,  one 
would  expect  estimates  of  02  and  d2  to  be  indepen¬ 
dent  of  xi  and  L0D{2\\)  =  L0D{2).  However,  the 
profile  likelihoods  for  possible  minor  loci  for  days  to 
budding  on  chromosomes  6  and  7  changed  substan¬ 
tially  after  removing  a  major  QTL  on  chromosome 
3  (Figures  2  and  4).  Further,  the  MLE  for  the  first 
minor  QTL  is  at  one  end  of  chromosome  7,  not  in  the 
middle  of  chromosome  6  as  Figure  2  implies.  These 
discrepancies  may  be  due  to  epistasis  (interaction), 
cosegregation  of  chromosomes  during  meiosis,  or  to  a 
problem  with  modest  sample  size  imbalance. 

Discussion 

Maximum  likelihood  interval  mapping  of  QTL 
builds  naturally  on  classical  approaches.  Important 
computational  and  theoretical  issues  remain  in  link¬ 
age  map  construction  and  finding  QTL. 

Several  sources  of  variation  arise  in  building  linkage 
maps.  “Riflotyping”  of  polymorphisms  involves  a  vi¬ 
sual  assay  of  thousands  of  columns  on  blots,  although 
these  may  soon  be  scanned  by  computer.  Riflotype 
errors  of  RFLP  patterns  along  linkage  maps  may  af¬ 
fect  estimates  of  map  distance  and  marker  loci  order 
(Steve  Knapp,  Tom  Osborn,  pers.  comm.). 

The  interval  mapping  approach  to  QTL  assumes 
independence  between  marker  intervals  and  that  epis¬ 
tasis  (interaction)  between  loci  is  negligible  (Steve 
Knapp,  Terry  Speed,  pers.  comm.).  Variation  in  esti¬ 
mated  marker  location  on  the  linkage  map  may  affect 
QTL  peaks  and  parameter  estimates  (Figure  3).  Fur¬ 
ther,  the  empirical  distribution  of  LOD  scores  needs 
investigation  under  varied  conditions. 


As  markers  become  more  closely  spaced,  one  won¬ 
ders  how  information  from  neighboring  regions  could 
be  effectively  included  in  the  estimation  of  QTL,  par¬ 
ticularly  when  there  may  be  multiple  loci.  Present 
technology  allows  closely  spaced  markers  (1-2  cM), 
increasing  the  problems  of  riflotyping.  This  raises 
both  estimation  and  design  questions:  should  one 
gather  more  F2s  or  more  markers?  How  can  one  ac¬ 
count  for  riflotype  and  other  errors  in  the  estimation 
procedure?  Finally,  how  can  one  efficiently  use  infor¬ 
mation  in  the  local  neighborhood  of  QTL  to  smooth 
the  likelihood  surface  by  appropriate  penalization? 
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(b)  Profile  Trace  for  additive  factor 


(c)  Profile  Trace  for  dominance  factor 
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Figure  2.  Profile  Likelihood  for  Days  to  Budding 
(a)  Chromosome  3 


Figure  4.  Conditional  Profile  Likelihood  for  Days  to  Budding 
(a)  Chromosome  3 


(c)  Chromosome  7 
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I.  Introduction 

The  recent  and  cilmost  monolithic  surge  in  inter¬ 
est  in  molecular  genetics  and  genetic  analysis  in  gen¬ 
eral  has  been  complimented  to  a  great  degree  by  recent 
advances  in  computer  science.  For  instance,  the  anal¬ 
ysis  of  protein  structure  and  function  —  a  vital  com¬ 
ponent  in  assessing  causal  pathways  between  low-level 
genomic  phenomena  and  phenotypic  expression  —  has 
been  greatly  aided  by  contemporary  visualization  and 
high-speed  data  processing  machinery.  On  another 
plane,  the  comparison  and  analysis  of  genome  sequence 
data  would  be  virtually  impossible  without  supercom¬ 
puters.  Despite  this  apparent  affinity  between  contem¬ 
porary  genetic  analysis  and  computer  science,  there 
exist  a  number  of  areas  in  genetic  research  which  have 
not  yet  tried  to  exploit  specialized  high  speed  comput¬ 
ing  machinery.  One  such  area  is  the  statistical  anal¬ 
ysis  of  linkage  and  segregation  phenomena  involving 
quantitative  traits.  This  is  odd  given  the  fact  that  a 
great  deal  of  theoretical  or  analytic  work  in  these  ar¬ 
eas  suggests  and  explicitly  recommends  the  use  of  high 
speed  or  novel-design  computers;  see,  for  example,  El¬ 
ston  and  Stewart  (1971),  Lange  and  Elston  (1975),  El¬ 
ston  (1981),  Boyle  and  Elston  (1979),  and  Cannings, 
Thompson,  and  Skolnick  (1978).  In  this  paper,  a  par¬ 
allel  strategy  for  computing  likelihoods  on  large  com¬ 
plex  pedigrees  with  quantitative  phenotype  data  is  dis¬ 
cussed  that  makes  use  of  a  basic  master/worker  inter¬ 
connect  paradigm. 

II.  Evaluating  Pedigree  Likelihoods 

Consider  a  locus  with  2  alleles,  A  and  a,  that 
produces  3  genotypes,  AA,  Aa.  and  oa,  occurring  in 
the  Hardy-Weinberg  equilibrium  dictated  proportions 
f<AA)  =  /(-4a)  =  2p(l  -  p).  and  /(aa)  =  (1  ~ 

p)“,  where  p  is  the  frequency  of  the  A  allele.  Asso¬ 
ciated  v/ith  each  genotype  is  a  mean  effect  pg.  5  £ 
{A,4,  .4a,aa  },  and  a  common  variance,  cr^ .  It  should 
be  understood  '.hat  trait  values  are  taken  to  be  nor¬ 
mally  distributed  around  the  relevant  genotype  mean. 

Consider  further  a  pedigree  with  N  members  of 


arbitrary  complexity  but  without  loops  (i.e.,  inbreed¬ 
ing).  For  those  pedigree  members  whose  parents  are 
not  in  the  pedigree,  the  unconditional  probability  that 
they  have  genotype  g  is  dictated  by  the  frequency  of 
the  genotype  f{g).  For  those  pedigree  members,  o, 
whose  parents,  m  and  /,  are  in  the  pedigree,  the  f{g) 
parameters  are  replaced  by  transmission  probabilities, 
'’'iSold ftGm)-.  or  the  probabilities  that  an  offspring,  o. 
has  genotype  g^  given  that  his  mother  m  and  father 
/  have  genotypes  gm  and  g  f,  respectively.  Using  this, 
the  likelihood  of  the  parameters  p,  PAa<  Maa- 

given  data  A'  =  (xi,...,x„)  collected  on  a  pedigree 
with  N  members  can  be  written  as: 


M  .4aiMaa!  O’  1-^  )  — 
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-  U; 


<5(3i)i/(ei  \pg,,cT^)6{g2)(i>{x2\Pg:,cr~) 

where  the  sums  over  the  gi-i—  1, . . . ,  N  .  are  sums  over 
all  possible  genotypes  member  i  might  have,  6(-)  is  ei¬ 
ther  an  /(•)  function  or  an  r(oim, /)  function,  depend¬ 
ing  on  whether  or  not  the  pedigree  member’s  parents 
are  pedigree  members,  and  4>\x'\p,(y~)  is  the  normal 
density  function.  The  compound  sum  in  equation  (T) 
over  the  gi  can  be  quite  large  and  therefore  prohibitive 
computationally  (e.g.,  for  3  genotypes,  the  sum  for  a 
pedigree  with  N  members  would  involve  3^  terms'). 

III.  Parallel  Likelihood  Evaluation 

The  pioneering  papers  of  Elston  L  Stewart  (1971). 
Lange  &  Elston  (1974),  and  Cannings,  Thompson,  k 
Skolnick  (1978),  all  showed  how  the  compound  sum 
in  equation  (1)  could  be  written  as  an  iterated  sum. 
The  basic  idea  is  to  take  small  groups  of  closely  related 
pedigree  members  (e.g..  nuclear  families  or  small  pedi¬ 
grees)  and  compute  likelihoods  involving  these  mem¬ 
bers  conditionally  on  the  genotypes  of  some  “pivotal” 
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member  of  the  group  who  is  also  the  member  of  an¬ 
other  group.  These  conditional  likelihoods  are  saved 
and  incorporated  into  the  evaluations  of  likelihoods  of 
other  groups  of  pedigree  members.  As  an  example, 
consider  the  pedigree  in  figure  la  and  the  groupings 
for  that  pedigree  depicted  in  figure  lb.  Note  that  like¬ 
lihoods  involving  group  nj  with  members  1,  2,  4,  6, 
6,  9,  11,  13  must  be  computed  after  likelihoods  in¬ 
volving  groups  ni,  ■■■  ,nj  since  the  genotypes  of  mem¬ 
bers  4,  6,  8,  9,  11,  13,  are  needed  in  the  computations 
for  groups  n2,  ...,n7.  Consider  calculations  involving 
712.  The  likelihood  involving  members  4,  15.  16  can 
be  computed  conditionally  on  each  possible  genotype 
(AA,Aa,aa)  for  member  4.  These  three  conditioned 
likelihoods  are  saved.  This  same  process  is  done  for 
713, ...,716  by  conditioning  on  members  6,8,9,11,13, 
respectively.  The  resulting  condtional  likelihoods  are 
then  used  to  “weight”  the  possible  genotype  arrange¬ 
ments  considered  in  the  likelihood  evaluation  of  nj 
members  4,  6,  8,  9,  11,  13  in  conjunction  with  members 
1  and  2. 

For  pedigrees  which  have  a  single  line  of  descent 
emanating  from  each  spouse-pair  (e.g.,  only  one  spouse 
has  his/her  parents  in  the  pedigree,  as  in  figure  2),  an 
implemenation  of  the  Elston-Stewart  algorithm  on  a 
parallel  computer  can  be  described  as  follows.  Start¬ 
ing  with  the  youngest  (or  latest  generation,  G),  nu¬ 
clear  families  compute  the  conditional  likelihoods  of 
the  members  of  these  nuclear  families  conditioning  on 
those  parents  who  are  offspring  in  the  G  —  1  generation 
of  nuclear  families.  If  there  are  nuclear  families  in 
the  fth  generation,  then  these  conditional  likelihoods 
can  be  computed  on  £  processors  simultaneously.  Once 
conditional  likelihoods  for  a  generation’s  nuclear  fam¬ 
ilies  have  been  computed,  they  are  saved  and  sent  out 
for  processing  with  the  next  oldest  generation’s  nuclear 
families.  This  process  continues  until  the  final  “root” 
nuclear  family’s  likelihood  is  computed.  The  running 
time  of  this  strategy  would  have  the  simple  form; 

G 

^(”l)  ^  ,n/],  (2) 

1=2 

where  s(ni)  is  the  size  of  nuclear  family  rij,  G  is  the 
number  of  nuclear  family  generations,  and  is  the 
number  of  nuclear  families  at  generation  i.  It  can  be 
seen  that  the  iargest  nuclear  family  at  generation  £ 
dominates  the  computation  time  spent  computing  the 
conditional  likelihoods  associated  with  nuclear  families 
at  that  generation,  and  that  a  single  processor  time  is 


needed  to  compute  the  likelihood  involving  the  final 
“root”  nuclear  family  (ni).  Note  that  since  there  can 
be  no  genotype  elimination  based  on  phenotype  in¬ 
formation  for  quantitative  trait  analysis,  running  time 
scales  with  the  size,  not  phenotype  arrangement,  of  a 
nuclear  family. 

For  more  complex  pedigrees  a  different  strategy  is 
needed.  Consider  the  pedigree  as  a  connected  graph, 
where  each  nuclear  family  in  the  pedigree  is  a  node. 
The  edge  relationships  between  the  nodes  are  dictated 
by  the  relatedness  of  the  members  of  the  nuclecir  fam¬ 
ilies  comprising  the  pedigree.  Each  node  is  assigned 
a  weight  equal  to  the  number  of  members  in  the  nu¬ 
clear  family  it  represents.  Figure  3b  depicts  the  graph- 
theoretic  representation  of  the  pedigree  displayed  in 
figure  3a.  Figure  4  displays  a  more  complicated  pedi¬ 
gree’s  graph  respresentation.  To  optimally  compute 
a  likelihood  involving  a  complex  pedigree  in  pcuallel, 
compute,  for  each  node,  the  sum  of  the  weights  of  all 
those  nodes  in  an  edge  relationship  with  nodes  them¬ 
selves  in  M  edge  relationship  that  ultimately  leads  to 
the  node  in  question.  Call  the  nodes  entering  into 
each  of  these  sums  a  “path”.  Let  Ht  be  the  “depth” 
of  (i.e.,  the  number  of  nodes  implicated  in)  each  path, 
i.  The  node  with  the  smallest  maximum  path  (i.e., 
sum  of  weights)  should  be  taken  as  representing  the 
“root”  nuclear  family  whose  likelihood  calculations  are 
computed  last.  The  nuclear  families  furthest  away  in 
node  representation  from  the  root  for  each  path  are 
distributed  to  different  processors  for  conditional  like¬ 
lihood  evaluation.  Once  the  conditioning  processes  are 
completeo  for  a  nuclear  family,  the  likelihoods  are  sent 
to  a  processor  which  will  compute  likelihoods  of  a  nu¬ 
clear  family  whose  node  representation  is  in  an  edge 
relationship  with  the  node  representation  of  the  nu¬ 
clear  family  in  question.  Note  that  some  conditional 
likelihoods  will  be  sent  to  a  common  processor  (i.e., 
two  nodes  have  an  edge  relationship  with  a  common 
node).  In  this  way,  the  conditioning  processes  will  con¬ 
verge  to  the  root  node,  and  thus  give  the  complete 
likelihood  of  the  pedigree.  See  Schorl  (1991)  for  more 
details,  an  assessment  of  the  running  time,  and  some 
experimental  results. 

IV.  Discussion 

The  algorithm  to  compute  pedigree  likelihoods 
outlined  above  is  intuitive,  but  does  possess  some  mi¬ 
nor  problems.  First,  it  is  imperative  that  an  efficient 
way  of  determining  the  optimal  order  in  which  to  com¬ 
pute  the  nuclear  families  in  the  pedigree  be  used  or 
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Amdahl’s  law  will  render  the  computational  savings 
in  computing  the  conditional  likelihoods  in  parallel 
useless.  Second,  not  all  pedigrees  will  v/ork  well  with 
the  algorithm;  for  instance,  a  pedigree  which  is  sim- 
pl.v  a  horizontal  chain  of  nuclear  families  can  have  no 
more  than  two  of  its  constituent  nuclear  families’  con¬ 
ditional  likelihoods  computed  simultaneously.  Third, 
it  may  be  the  case  that  over  the  course  of  the  compu¬ 
tations  through  the  various  “paths”  some  processors 
will  not  be  utilized,  resulting  in  an  inefficient  use  of 
the  computer.  On  the  other  hand,  the  algorithm  can 
be  improved  by  letting  a  number  of  processors  work 
on  parts  of  the  sums  needed  in  the  computation  of  a 
given  nuclear  family’s  likelihood  calculations.  In  this 
way,  processors  would  be  utilized  to  a  greater  degree 
and  a  faster  turn  around  time  would  result  also. 
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Abstract 

The  problem  of  estimating  parameters  in  a  complex  com¬ 
puter  simulator  of  a  nuclear  fusion  reactor  from  an  ex¬ 
perimental  database  is  treated.  Practical  limitations  do 
not  permit  a  stauidard  statistical  analysis  using  nonlin¬ 
ear  regression  methodology.  The  assumption  that  the 
function  giving  the  true  theoretical  predictions  is  a  real¬ 
ization  of  a  Gaussian  stochastic  process  provides  a  statis¬ 
tical  method  for  combining  information  from  relatively 
few  computer  runs  with  information  from  the  experimen¬ 
tal  database  and  making  inferences  on  the  parameters. 

1  Introduction  and  Problem 
Formulation 

Mathematical  models  of  natural  phenomena  are  often 
implemented  in  complex  computer  programs.  Sometimes, 
the  mathematical  model  is  completely  specified,  and  it  is 
only  necessary  to  execute  the  code  to  make  predictions 
about  the  natural  process  under  study.  However,  there 
may  be  unknown  parameters  in  the  mathematical  model. 
If  there  is  an  available  database  of  experimental  results, 
then  statistical  methods  may  be  employed  to  make  in¬ 
ferences  about  the  unknown  parameters  and  predictions 
of  the  natural  process.  In  this  article,  we  consider  an 
example  of  such  a  problem  in  nuclear  fusion  research. 
In  this  application,  the  computer  implementation  of  the 
mathematical  model  is  so  complex  that  it  is  not  prac¬ 
tically  possible  to  execute  the  program  as  many  times 
as  is  needed  to  perform  a  classical  statistical  analysis. 
We  propose  a  Bayesian  methodology  which  allows  us  to 
combine  limited  information  on  the  mathe.natical  model 
(obtained  from  relatively  few  computer  runs)  with  the 
experimental  database  and  still  make  the  requisite  infer¬ 
ences.  The  methodology  we  propose  may  prove  useful 
in  other  applications  in  other  disciplines  where  complex 
computer  codes  must  be  tuned  to  real  data  sets. 
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A  tokamak  (a  Russian  acronym  for  “toroidal  mag¬ 
netic  chamber”)  is  a  device  for  producing  plasmas  ca¬ 
pable  of  nuclear  fusion  (Wesson,  1987).  There  are  cur¬ 
rently  45  tokamaks  worldwide  and  a  large  database  of 
individual  “shots”  (i.e.,  experimental  runs)  potentially 
available.  Currently,  none  of  the  tokamaks  is  capable  of 
producing  more  energy  from  fusion  reactions  than  the 
amount  of  energy  neded  for  confinement  and  heating  of 
the  plasma.  It  is  desired  that  the  next  generation  of  toka¬ 
maks  should  reach  at  least  the  break-even  point.  As  such 
devices  are  extremely  expensive  (approximately  $10*°), 
it  is  critical  that  a  good  model  is  available  for  design 
purposes. 

There  are  several  variables  which  the  experimenters 
can  control  from  shot  to  shot,  of  which  the  main  ones 
are:  /  =  Plasma  current;  P  —  Heating  power;  n  =  par¬ 
ticle  density;  B  =  toroidal  magnetic  field.  There  are 
also  some  geometrical  variables,  but  for  simplicity  we 
will  limit  ourselves  to  two  tokamaks,  ASDEX  (in  Ger¬ 
many)  and  PDX  (in  Princeton),  which  have  fixed  simple 
geometries,  so  that  the  only  independent  variables  are 
given  above.  We  will  use  the  first  four  components  of 
the  five  dimensional  vector  x  to  denote  the  logarithms 
of  the  independent  variables,  and  the  fifth  component 
will  be  an  indicator  of  the  machine,  say  ig  =  0  for  PDX 
and  Xg  =  1  for  ASDEX. 

A  primary  factor  in  promoting  fusion  is  energy  con¬ 
finement,  which  is  measured  by  the  global  energy  con¬ 
finement  time  Tg..  It  is  a  measure  of  the  rate  at  which 
energy  is  escaping  from  the  tokamak  at  steady  state.  We 
will  let  y  denote  the  measured  value  of  logTg.  We  have 
32  observations  from  ASDEX  and  42  from  PDX,  giving 
a  total  of  74  observations  (x,-,y,). 

A  comprehensive  mathematical  model  for  tokamaks 
has  been  developed  and  implemented  in  a  computer  code 

and  NSA  Grant  MDA  904-89-H-201 1.  Support  for  com¬ 
puting  was  provided  by  Cray  Research  and  National 
Center  for  Supcrcomputing  Applications  at  the  Univer¬ 
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known  as  Baldur  (Singer,  et.  al.,  1988).  There  are  sev¬ 
eral  unknown  parameters  in  the  Baldur  model,  however, 
which  we  generically  denote  by  the  vector  c.  Each  Bal¬ 
dur  run  requires  inputting  a  value  of  x,  the  experimen¬ 
tally  observable  independent  variables  and  machine  in¬ 
dicator,  and  a  value  of  c.  We  let  exact  theoretical  pre¬ 
diction  (which  is  the  ideal  output  of  Baldur)  be  denoted 
y(x,c). 

In  principle,  inferences  about  c  can  be  made  with 
nonlinear  regression  techniques.  Thus,  c  may  be  esti¬ 
mated  by  minimization  of  the  residual  sum  of  squares 

74 

RSS(c)  =  5;;[y,-y(x,,c)p 

i=l 

where  j/j  is  an  observed  log  and  Y{x^,c)  is  the  cor¬ 
responding  theoretical  prediction.  Since  c  is  four  di¬ 
mensional  for  our  case,  maybe  about  100  evaluations  of 
RSS(c)  would  be  needed  for  a  nonliner  optimizer  to  find 
c.  Even  if  the  nonlinear  least  squares  estimate  c  and 
RSS(c)  are  known,  it  requires  a  minimum  of  10  evalua¬ 
tions  to  estimate  the  Hessian  for  contructing  confidence 
regions. 

There  are  two  important  features  of  the  Baldur  code 
which  make  this  classical  nonlinear  regression  approach 
infeasable.  For  one  thing,  each  Baldur  run  takes  about 

4  minutes  of  CPU  time  on  a  Cray  II  supercomputer. 
Thus,  even  a  single  evaluation  of  RSS(c)  requires  about 

5  hours  of  supercomputer  time,  and  10  to  100  evalua¬ 
tions  of  RSS(c)  are  simply  not  practical.  Secondly,  Bal¬ 
dur  does  not  output  the  exeict  value  of  the  prediction 
V'(x,c),  but  has  a  sampling  error  because  of  a  Monte 
Carlo  integration  inside  of  one  routine.  Multiple  runs 
of  Baldur  with  the  same  inputs  (x,c)  but  different  seeds 
would  be  required  to  obtain  an  accurate  value  of  Y (x,  c). 

We  are  thus  constrained  to  making  a  limited  num¬ 
ber  of  Baldur  runs  in  order  to  obtain  noisy  values  of 
T(x,c),  and  then  to  somehow  combine  this  incomplete 
and  inexact  computer  data  with  the  experimental  data 
in  order  to  make  inferences  on  the  parameter  vector  c. 

2  Statistical  Model 

We  propose  a  statistical  model  for  the  problem  described 
above.  The  true  function  y(x,c)  is  assumed  to  be  a  re¬ 
alization  of  a  stochastic  process.  Such  models  have  been 
successfully  used  for  design  and  analysis  of  computer 
experiments  (Sacks,  Schiller,  and  Welch,  1989,  abbre¬ 
viated  [SSW];  Sacks,  Welch,  Mitchell,  and  Wynn,  1989, 
abbreviated  [SWMW]).  Further  details  on  our  proposed 
methodology  are  given  in  Park  (1991,  abbreviated  [P]). 

We  will  treat  the  models  for  the  two  tokamaks  AS- 
DEX  and  PDX  entirely  independently.  Thus,  for  most 


of  this  section,  the  variable  x  does  not  include  an  in¬ 
dicator  for  the  machine  and  we  will  assume  only  one 
machine.  Having  obtained  a  likelihood  for  one  machine, 
the  combined  likelihood  for  both  machines  is  obtained 
my  multiplication  of  their  individual  likelihoods  (addi¬ 
tion  of  log  likelihoods).  We  are  not  convinced  this  is  the 
best  approach,  but  it  is  easy  and  probably  leads  to  valid 
if  not  fully  efficient  results. 

For  convenience,  we  will  let  s  and  t  denote  a  value  of 
the  vector  (x,c),  where  x  and  c  are  both  4-dimensional, 
so  s  and  t  are  8-dimensional.  We  will  use  d  to  denote 
the  general  dimensions  of  s  and  t. 

It  is  assumed  that  Y (s)  is  a  Gaussian  stochastic  with 
constant  mean 

£;y(s)  =  p, 

and  covariance  function 

d 

Cov[y(s),y(t)]  =  rr^exp  -^«i(Sj  -t,)^ 

.  1=1 

Here,  /?,  >  0,  and  >  0,  1<  i  <  d,  are  parameters. 

More  general  models  are  possible  (see  [SSW],  [SWMW], 
or  [P]),  but  some  data  analysis  and  model  fitting  has 
suggested  a  model  of  this  form  is  appropriate. 

The  Baldur  code  is  executed  at  a  set  of  inputs  s^, 
1  <  i  <  n^i  giving  observations  1  <  »  <  which 
are  modelled  as 

ViC  =  + 

where  the  random  errors  are  assumed  to  be  i.i.d. 
N{0,  tr^),  and  independent  of  y .  Note  that  the  subscript 
“C”  designates  computer  data. 

The  experimental  data  (x^,!/^^;),  1  <  i  <  n^,  is 
modelled  as 

ViE  =  ^K.Cq)  +  ei£; 

where  the  e.g  are  i.i.d.  independent  of  all  pre¬ 

viously  mentioned  random  quantities.  Also,  Cg  denoted 
the  true  unknown  value  of  the  fusion  theory  parameters 
to  be  estimated. 

It  is  convenient  to  reparameterize  the  variances  of 
the  random  errors  in  terms  of  variance  ratios,  viz. 

Thus,  in  addition  to  the  4-dimensional  theory  parameter 
Cq,  we  also  need  to  estimate  /?,  6  (8-dimensional),  7c. Te. 
and  Oy.  Further,  we  assume  each  of  these  parameters 
(other  than  Cg)  is  different  for  the  tokamaks  PDX  and 
ASDEX.  The  Gaussian  process  assumptions  allow  us  to 
develop  formulae  for  the  multivariate  normal  likelihoods, 
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which  arc  maximized  by  a  numerical  optimization  pro¬ 
gram. 

One  perspective  on  the  above  is  that  we  are  fitting  a 
nonparametric  regression  function  y(x,c)  using  an  em¬ 
pirical  Bayesian  methodology.  Part  of  the  data  (the  ex¬ 
perimental  observations)  are  missing  components  of  the 
independent  variable,  and  the  missing  components  have 
a  common  value  Cg.  The  parameters  6,  7^,  and 
control  the  “smoothness”  of  the  fit,  analogously  to  the 
smoothing  parameters  tn  and  A  in  the  Bayesian  interpre¬ 
tation  of  smoothing  splines  (Eubank,  1988,  pp.  233-248). 

Among  the  several  strategies  we  have  tried  for  es¬ 
timating  parameters,  the  one  which  has  worked  best  in 
simulations  (reported  in  [P])  is  the  following;  estimate 
the  parameters  6  and  7^  by  maximum  likelihood  using 
the  computer  data  alone.  Then  combine  the  computer 
data  and  experimental  data  to  estimate  the  remaining 
parameters:  fi,  rr^,  ffg,  and  Cg.  One  reason  this  method 
may  work  well  is  that  it  uncouples  some  of  the  smooth¬ 
ing  parameter  estimation  from  the  estimation  of  Cg.  It 
has  the  added  advantage  that  it  reduces  computational 
Uwc. 

There  was  strong  prior  knowledge  about  the  theory 
parameter  vector  c  which  was  codified  into  a  log-normal 
prior  distribution.  The  components  of  c  were  a  priori 
independent  with  the  following  normal  distributions 
logc,  ~  /V(0,(21og2)2), 
logcj  ~  A^(log3,(log2)-), 
logC3~iV(0,(log2)^), 
loge^  ~  A/(ilog2,  (ilog2)2). 

'Fhe  corresponding  quadratic  terms  were  added  in  to  the 
log  likelihood  to  penalize  values  of  c  for  being  far  from 
the  prior  mean.  The  maximization  of  this  penalized  log 
likelihood  amounts  to  finding  the  posterior  mode. 

One  strategy  which  we  found  useful  for  parameter 
parsimony  was  to  constrain  components  of  6  to  be  equal. 
Based  on  initial  estimates  wherein  all  components  of  0 
were  varied  independently,  we  chose  three  blocks  of  the 
components  of  6  which  were  constrained  to  have  a  com¬ 
mon  value,  and  then  computed  maximum  likelihood  es¬ 
timates  under  these  constraints. 

To  assess  the  accuracy  of  our  estimates  of  Cg,  esti¬ 
mated  standard  errors  are  computed  using  the  diagonal 
entries  of  the  inverse  of  the  Hessian  of  the  posterior  log 
likelihood  evaluated  at  the  maximum.  'I'liis  Hessian  was 
evaluated  numerically.  While  we  are  aware  of  no  directly 
reh'vant,  asymptotic  (or  finite  sample)  theory  to  justify 
this,  the  simulations  reported  in  [P]  suggest  that  it  does 
not  work  badly,  although  we  have  far  too  few  simula¬ 
tions  to  as.sess  coverage  probabilities.  In  any  event,  it 
does  provide  a  reasonable  indication  of  error. 


3  Numerical  Results 

In  this  section  we  report  the  results  of  applying  the 
methodology  of  the  previous  section  to  the  Baldur/Toka- 
mak  problem.  More  details,  including  the  data  sets,  may 
be  found  in  [P]. 

Table  1  shows  the  smoothing  parameter  estimates 
obtained  from  only  the  computer  data.  To  recai)itulate, 
for  each  computer  data  set  (e.g.,  —  34  observations 

for  PDX),  we  can  maximize  the  likelihood  based  in  the 
vector  of  observations  =  iVic  ■  ^  ^  ”c)  obtain 

estimates  of  /?,  Oy,  7^,  and  9.  However,  only  the  esti¬ 
mates  of  7^.  and  0  are  used  in  the  subsequent  analysis. 

One  will  note  that  we  constrained  0-^  —  9^,  $2  =  0^  =  0-j,  ( 

and  0^  =  0^  =  Og.  This  decision  was  made  to  parsimo¬ 
niously  parametrize  after  initially  estimating  8  indeiien-  ■ 

dent  0-'s  for  each  simulated  machine.  , 

Tables  2  and  3  present  the  results  of  the  subsequent 
likelihood  maximization  when  computer  and  experimen¬ 
tal  data  from  both  machines  were  pooled.  Table  2  shows 
parameter  estimates  which  are  individual  to  a  given  ma¬ 
chine.  The  values  in  Table  3  are  the  ones  of  most  interest 
—  the  estimates  of  the  theory  parameters.  The  estimate 
of  Cj  has  a  relatively  large  estimated  standard  erroi,  in¬ 
dicating  that  our  knowledge  of  it  at  this  time  is  rather 
uncertain. 

Finally,  in  Figures  1  and  2  we  show  residual  plots 
(residuals  vs.  predicted  values).  Based  on  our  simula¬ 
tion  experience  with  toy  models  reported  in  [P],  these 
plots  suggest  a  relatively  good  fit.  In  particular,  the  pre¬ 
dicted  values  for  the  computer  data  (dots)  have  a  wider 
(horizontal)  range  than  the  experimental  predicted  val¬ 
ues.  Also,  the  residuals  for  the  computer  data  have  a 
much  smaller  (vertical)  range  then  those  of  the  experi¬ 
mental  data.  These  indicate  we  are  fitting  the  computer 
data  relatively  well,  ard  also  getting  good  coverage  of 
the  range  of  y(x,c). 

Concluding  Remarks 

There  are  a  number  of  issues  which  arose  in  this 
investigation  which  we  have  not  mentioned  for  lack  of 
space.  One  is  the  design  of  the  computer  experiment. 

One  problem  which  arose  during  the  collection  of  com¬ 
puter  data  is  that  the  Baldur  code  was  modified  to  im¬ 
prove  convergence.  There  was  some  change  in  output 
values  between  the  new  and  old  codes  when  the  same 
inputs  were  tried,  but  it  was  on  the  same  order  as  our 
estimate  at  the  time  of  the  Monte  Carlo  sampling  er¬ 
ror.  However,  subsequent  analysis  suggested  (he  Monte 
Carlo  error  was  much  smaller  than  we  thought,  and  that 
the  two  versions  of  the  code  produced  somewhat  differ¬ 
ent  answers.  All  of  our  results  above  are  based  on  the 


data  from  the  new  and  improved  program.  This  does 
raise  the  question  of  the  size  of  the  approximation  and 
roundoff  error  in  the  code  and  the  extent  to  which  that 
affects  the  parameter  estimates. 

This  is  a  complex  problem  and  there  is  much  oppor¬ 
tunity  for  future  research. 
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Table  1.  Smoothing  Parameter  Estimates  from  Computer  Code 


Symbol 

Description 

PDX 

Value 

ASDEX 

Value 

ne 

Computer  data  set 

34 

31 

01 

Correlation  Coefficient  for  ci 

1.6 

.033 

02 

Correlation  Coefficient  for 

.19 

.13 

03 

Correlation  Coefficient  for  C3 

.19 

.13 

04 

Correlation  Coefficient  for  C4 

1.1 

.35 

05 

Correlation  Coefficient  for  log  I 

1.1 

.35 

06 

Correlation  Coefficient  for  log  B 

1.1 

.35 

07 

Correlation  Coefficient  for  log  tic 

.19 

.13 

06 

Correlation  Coefficient  for  log  P 

1.6 

.033 

Variance  ratio 

4.9  X  10-“ 

1.1  X  10-3 

Table  2.  Parameter  Estimates  for  Individual  Tokamaks 


Symbol 

Description 

PDX 

Value 

ASDEX 

Value 

Experimental  data  set  sample  size 

42 

32 

/? 

Mean  value  for  Y 

-1.68 

-1.59 

Variance  of  Y 

.0045 

.35 

Variance  ratio  o’g/o’y 

.023 

.0064 

Table  3.  Estimates  for  Theory  Parameters 


Symbol 

Description 

PDX 

Value 

ASDEX 

Value 

Cl 

Drift  Waves  Coefficient 

1.65 

.15 

C2 

Rippling  Coefficient 

2.08 

.65 

C3 

Resistive  B^dlooning  Coefficient 

1.14 

.22 

C4 

Critical  Value  for  the  Ion 
Temperature  Gradient  Mode 

1.16 

.095 

residual  residual 
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Abstract 

There  is  widespread  use  of  computer  models  as  tools  in 
scientific  research.  As  surrogates  for  physical  or 
behavioral  systems,  such  models  can  be  sL'Ijjected  to 
experimentation,  the  goal  being  to  predict  how  the 
corresponding  nial  system  would  behave  under  certain 
conditions.  For  long-running  (expensive)  model  codes, 
there  may  be  a  severe  limitation  on  the  number  of  experi¬ 
ments  that  can  reasonably  be  done.  This  motivates  the 
construction  of  a  fast-running  (cheap)  approximation  to 
the  original  code,  for  use  in  experiments  where  a  large 
number  of  runs  may  be  necessary.  Here  we  discuss  our 
approximation  of  a  simulation  model  for  the  compression 
molding  of  sheet  molding  compound,  applied  to  the 
manufacture  of  an  automobile  hood.  The  approximation 
was  constructed  using  Bayesian  interpolation  methods  for 
prediction  of  the  movement  of  the  flow  front.  The  predic¬ 
tions  were  based  on  data  generated  by  a  sequence  of  com¬ 
puter  experiments,  using  designs  chosen  according  to  a 
type  of  D-optimality  criterion. 

1  Introduction 

The  purpose  of  this  paper  is  to  demonsu’ate  the  application 
of  Bayesian  methods  for  design  and  analysis  of  computer 
experiments  to  the  construction  of  a  "cheap"  substitute  for 
an  "expensive"  computer  model.  As  our  example,  we 
shall  use  a  computer  simulation  model  for  a  compression 
mold-filling  process  that  is  used  in  the  manufacture  of 
automobile  hoods.  Our  primary  use  of  this  model  was  to 
generate  prediction  formulas  that  could  serve  as  fast  sub¬ 
stitutes  for  the  real  model  in  certain  well-defined  tasks. 
This  done,  we  did  not  follow  through  any  further,  so  this 
account  is  best  considered  as  a  realistic  example  rather 
than  a  complete  scientific  application. 

'Research  sponsored  by  the  Applied  Maihematical  Sciences 
Research  Program,  Office  of  Energy  Research,  U.S.  Dcpartmenl  of 
Energy  Contract  DE-AC05-840R21400  with  Martin  Marietta  Energy 
Systems,  Inc. 


Except  for  some  philosophical  differences,  our  underlying 
approach  is  essentially  the  same  as  that  discussed  by 
Sacks,  Welch,  Mitchell,  and  Wynn  (1989).  As  noted 
there,  versions  of  this  approach  have  been  used  for  a  long 
time  in  various  settings,  e.g.,  kriging  and  Bayesian  interpo¬ 
lation.  The  details  of  the  method  (e.g.,  choice  of  correla¬ 
tion  function,  design  criterion)  are  more  in  line  with  Cur- 
rin,  Mitchell,  Morris,  and  Ylvisaker  (1991). 

2  The  Computer  Model 

Sheet  molding  compound  (SMC)  is  composed  of  polymer 
resin,  chopped  fibers,  filler,  and  additives.  Prior  to  the 
molding  process,  a  "charge* ,  or  piece  of  SMC,  is  cut  from 
a  sheet  and  placed  in  a  heated  mold.  The  process  is  begun 
by  closing  the  mold  slowly;  during  the  process  the  material 
flows  and  fills  the  mold  cavity.  After  filling,  a  constant 
force  is  maintained  on  the  mold,  as  the  curing  reaction 
proceeds;  then  the  part  is  removed  and  the  curing  is  com¬ 
pleted. 

Designers  of  the  manufacturing  process  are  concerned 
with  the  movement  of  the  flow  front;  it  is  desirable  that  the 
charge  fill  the  mold  evenly  and  rapidly,  without  the  pres¬ 
ence  of  "knit  lines"  formed  when  two  parts  of  the  flow 
front  meet.  To  help  determine  the  effect  of  the  design 
parameters  (e.g.,  the  initial  shape  and  placement  of  the 
charge)  on  the  flow  front  movement,  a  computer  simula¬ 
tion  model  is  used.  This  model  is  a  version  of  the  TIMS 
(TTiln  Mold  filling  Simulation)  model,  which  was 
developed  by  Tim  Osswald  and  Charles  Tucker  of  the 
Deparunent  of  Mechanical  Engineering  at  the  University 
of  Illinois.  The  version  we  used  came  to  us  through  the 
courtesy  of  Alonzo  Church,  Jr.  and  Daniel  Fleming  of 
GenCorp  Research,  who  were  of  great  help  to  us  in  learn¬ 
ing  to  use  it  and  in  evaluating  the  results.  The  theory  and 
numerical  implementation  are  described  in  Osswald  and 
Tucker  (1990).  The  inputs  to  the  code  include  the 
geometry  of  the  part,  the  material  properties  (e.g.,  viscos¬ 
ity),  the  closing  speed,  the  final  thickness  of  the  part,  and 
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the  shape  and  location  of  the  charge.  The  output  consists 
of  all  the  information  needed  to  predict  the  position  of  the 
flow  front  as  a  function  of  time.  The  code  uses  a  finite  ele¬ 
ment  method  to  .solve  a  system  of  differential  equations 
based  on  the  physics  of  the  process.  This  is  not  a  trivial 
computation  -  each  run  of  the  model  code  takes  4-5 
minutes  on  a  Cray  X-MP  computer.  For  specific,  well- 
defined  experiments,  it  is  worthwhile,  therefore,  to  seek  a 
fast  approximation  to  the  model;  this  is  the  purpose  of  the 
exercise  we  shall  describe  here.  Of  special  interest  to  us  is 
the  highly  multidimensional  nature  of  the  response  (flow 
front  movement).  Previous  applications  of  our  prediction 
method,  and  of  similar  methods  described  by  other 
authors,  have  been  concerned  with  prediction  of  a  single 
response  computed  from  the  output.  Although  we  shall  do 
nothing  more  than  apply  the  same  prediction  method 
separately  to  2345  related  respon.ses,  we  shall  see  that 
even  this  kind  of  naive  approach  can  be  useful. 

3  Predictors  and  Responses 

In  this  example,  we  are  concerned  only  with  the  effect  of 
the  initial  shape  and  location  of  the  charge.  The  input  that 
defines  this  is  a  list  of  "nodes"  (in  the  finite  element 
discretization  of  the  mold  surface)  that  are  filled  initially 
by  the  charge.  There  are  469  nodes  altogether,  and  the  ini¬ 
tial  charge  typically  fills  30  to  40  of  them.  (Although 
nodes  are  actually  points,  each  is  associated  with  a  small 
subvolumc  of  the  mold.  When  we  refer  to  a  node  as  being 
"filled",  we  are  really  referring  to  this  associated  subvo¬ 
lume.)  In  order  to  represent  the  list  of  initially  filled  nodes 
by  a  few  predictor  variables,  we  require  the  initial  shape  of 
the  charge  to  be  rectangular.  The  predictor  variables  are 
then  defined  by  the  boundaries  of  the  rectangle.  This  is 
done  conveniently  using  the  node  map  as  constructed  for 
the  finite  element  method,  where  the  nodes  form  an 
approximately  uniform  grid  over  the  part  of  the  mold 
where  the  charge  might  be  placed.  The  north  and  south 
boundaries  of  the  charge  correspond  to  the  predictor  vari¬ 
ables  t,  and  t2,  while  the  east  and  west  boundaries 
correspond  to  t3  and  t4,  respectively.  (The  scaling  is  such 
that  0<t2<ti<l  and  0<t4<t3<l.)  For  other 
geometries,  of  both  the  charge  and  the  region  of  the  mold 
into  which  the  charge  is  to  be  placed,  the  representation  of 
the  initial  shape  and  location  of  the  charge  by  a  few  pred¬ 
ictor  variables  might  be  considerably  more  difficult. 

The  next  part  of  the  setup  of  the  prediction  problem  is  to 
define,  from  the  mass  of  output,  a  manageable  set  of 
re.sponse  variables  that  will  permit  prediction  of  the  flow 
front.  The  output  gives  values  of  the  function  p^,  for  all 


nodes  m  =  1 . 469  at  each  time  step  in  the  simulation, 

where  Pn,(T)  denotes  the  proportion  of  node  m  that  is  filled 
at  time  t. 

At  each  node  m,  we  defined  the  five  responses 

y„i:  the  last  recorded  time  at  which  node  m 
is  empty  (p,„(y,„i)  =  0), 

y„2:  the  time  at  which  node  m  becomes 
25%  full  (pjy„2)  =  0.25), 

yn,3:  the  time  at  which  node  m  becomes 
50%  full  (pjy„3)  =  0.50), 

yn,4:  the  time  at  which  node  m  becomes 
75%  full  (pjy,„4)  =  0.75), 

y^s:  the  first  recorded  time  at  which  node  m 
is  100%  full  (pjy,„5)=l)- 

Since  these  values  are  not  given  directly  by  the  output, 
which  gives  values  of  p^,  at  various  times,  we  approxi¬ 
mated  them  by  linear  interpolation  of  the  ouqiut  data.  The 
prediction  problem  was  then  taken  to  be:  Approximate  the 

2345  functions  y^,  =  ymr(li ,  t2,  t3,  t4),  where  m  =  1 . 469 

and  r  =  1 , ...,  5,  over  the  region  defined  by  0  <  t2  <  tj  <  1 , 
0<t4<t3<l.  Two  further  practical  consuaints  on  the 
region  of  interest  were  added.  The  first  resuicted  the 
placement  of  the  charge  to  be  symmetric  about  the  north- 
south  center  line,  i.e.,  t3  +  t4=  1.0.  The  second  required 
that  the  number  of  the  nodes  initially  filled  by  the  charge 
be  between  30  and  40;  this  was  our  way  of  implementing  a 
requirement  that  the  area  of  the  mold  surface  initially 
covered  by  the  charge  be  fairly  constant. 

4  Design 

The  central  idea  (which  is  not  original  with  us)  is  to 
represent  uncertainty  about  each  function  yn,^  on  the  k- 
dimensional  region  of  interest  T  by  means  of  a  stochastic 
process  (random  field)  For  simplicity  and  conveni¬ 
ence,  we  use  stationary  Gaussian  (normal)  processes  as 
priors.  These  are  fully  described  by  a  constant 
Hn,r=  ElY„,(t)],  a  constant  =  V[Y^(t)),  and  a  correla¬ 
tion  function  where  Rmr(d)  =  Corr[Y„,(t+d),  Y^(t)l 
and  where  t  =  (t,,  •••  ,tk)  and  t-i-d  =  (tj-i-d] ,  •••  .t^-Klj.) 
are  any  two  "sites"  (points  in  T)  separated  by  a  difference 
vector  d.  For  simplicity,  we  also  take  the  2345  Y^,, 
processes  independent  of  one  other,  and  R^/d)  =  R(d)  for 
all  (m,r).  (The  choice  of  independence  is  made  at  the  cost 
of  ignoring  information  about  the  relationships  amor.g  the 
yn„’s  at  any  site.  We  have  not  found  it  feasible  to 
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implement  such  information  here.) 

For  a  design  criterion,  we  use  the  "maximum  entropy" 
principle  (Lindley  1956),  which  in  this  case  leads  to  a  kind 
of  D-optimality,  namely,  the  maximization  of  ICddU 
where  Cdd  is.  for  any  one  of  the  processes  the  nxn 
matrix  of  prior  correlations  among  the  design  sites 
(Shewry  and  Wynn  1987).  We  find  this  criterion  appeal¬ 
ing,  for  reasons  given  by  Currin  etal.  (1991),  but  other  cri¬ 
teria  could  be  used.  (See,  e.g..  Sacks,  Schiller  and  Welch 
1989  and  Sacks,  Welch,  Mitchell,  and  Wynn  1989.) 


Currin,  et  al.  (1991)  for  further  details. 

Here  the  set  of  candidate  runs  was  formed  by  first  letting  t, 
and  t2  take  any  of  11  levels  and  I3  and  Wj  lake  any  of  13 
levels,  subject  to  the  restrictions  on  the  region  of  interest 
noted  above. 

The  initial  10-run  design,  plus  an  additional  5  runs  that 
were  chosen  later,  are  shown  in  Table  1 . 

Initial  Eiesign 


Of  cour.se,  one  cannot  maximize  I  Cdd  '  without  specifying 
how  Cdd  depends  on  D.  For  our  priors,  this  means  speci¬ 
fying  the  correlation  function  R.  We  favor  using  a  weak 
correlation  function,  i.e.,  one  for  which  R(d)  decreases 
rapidly  to  zero  as  d  increases.  Such  a  snong  conviction  of 
prior  ignorance  is  not  useful  for  analysis,  since  one  would 
need  to  observe  y  at  very  many  sites,  located  densely  in  T, 
in  order  to  yield  predictions  that  are  usefully  precise.  At 
the  design  stage,  however,  we  feel  that  the  choice  of  a 
weak  correlation  function  is  appropriately  con.scrvative. 

For  design  purposes  then,  we  use  the  exponential  correla¬ 
tion: 


-0£id; 

R(d)  =  c  ' 


(4.1) 


where  0  is  "large".  Asymptotically  (as  0  ^  0°  ),  it  can  be 
shown  that  the  D-oplimalily  criterion,  where  (4.1)  is  used 
to  construct  Cdd.  maximizes  the  minimum  intersite  dis¬ 
tance  atiiong  design  points,  and  favors  those 

designs  with  the  fewest  pairs  whose  intersite  distance 
matches  this  minimum.  This  is  a  special  case  of  a  result 
due  to  Johnson,  Moore,  and  Ylvi.saker  (1990),  who  called 
such  designs  "maximin  distance"  designs.  In  this  sense, 
the  designs  we  construct  will  attempt  to  push  the  design 
points  as  far  away  from  each  other  as  fxissible. 


For  design  construction,  we  use  an  algorithm  similar  to 
DETMAX  (Mitchell  1974).  Starling  with  a  random  set  of 
n  sites,  the  algorithm  does  a  series  of  "excursions"  in 
which  candidate  sites  are  added  to  and  removed  from  the 
design.  When  adding  a  site,  the  chosen  site  is  intended  to 
be  the  one  al  which  the  posterior  variance,  based  on  the 
current  design,  is  largest.  It  may  not  be  possible  to  ensure 
this  if  there  are  many  sites  to  consider;  if  this  is  the  case, 
the  algorithm  does  a  limited  search.  When  removing  a  site, 
the  chosen  site  is  the  one  corresponding  to  the  largest  diag¬ 
onal  element  in  the  inverse  of  the  current  Cdd  matrix.  See 


Run 

h 

*^3 

t4 

1 

0.40 

0.00 

0.75 

0.25 

2 

0.40 

0.20 

1.00 

0.00 

3 

0.80 

0.60 

1.00 

0.00 

4 

1.00 

0.00 

0.58 

0.42 

5 

0.80 

0.40 

0.75 

0.25 

6 

0.60 

0.40 

0.92 

0.08 

7 

0.50 

0.20 

0.83 

0.17 

8 

0.70 

0.10 

0.67 

0.33 

9 

0.90 

0.60 

0.83 

0.17 

10 

1.00 

0.50 

0.67 

0.30 

Additional  Points 


Run 

h 

I3 

U 

11 

0.50 

0.00 

0.67 

0.33 

12 

0.70 

0.40 

0.83 

0.17 

13 

1.00 

0.60 

0.75 

0.25 

14 

0.60 

0.20 

0.75 

0.25 

15 

0.90 

0.20 

0.67 

0.33 

Table  1 .  Design  for  experiment  on  compression  molding 
model. 

The  need  for  the  additional  runs  was  clear  after  inspection 
of  the  cross-validation  predictions  based  on  the  initial 
experiment.  These  runs  were  chosen  using  the  same  algo¬ 
rithm  and  the  same  correlation  function  which  generated 
the  first  ten  runs.  The  full  15-run  design  populates  the 

region  of  interest  (which  is  relatively  small  here)  quite 

4 

densely;  the  maximum  distance  I  tj  -  Sj  I  between  any 

j=i 

feasible  site  t  not  in  the  design  and  the  closest  design  site  s 
is  0.2. 
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5  Prediction 

Predictions  were  made  using  standard  formulas  for  condi¬ 
tional  normal  distributions.  Let  be  the  vector  of  the  n 
observed  values  of  yn„.  The  mean  of  ¥„„(!)  given 

^mrj>  ~  ymrJJ 

ymrCO  ~  Mmr  (ymr.D  ~  f4nrfn)  (5.1) 

where  Cto  is  a  row  vector  that  holds  the  n  prior  correla¬ 
tions  between  Y„,(t)  and  and  J„  is  the  column  vector 
composed  of  n  I’s.  In  order  to  use  (5.1),  one  needs  to 
specify  the  prior  mean  and  the  correlation  function 
(needed  for  Cio  and  Cdd)-  In  our  approach,  we  arbitrarily 
chose  a  family  of  correlation  functions,  indexed  by  a  set  of 
parameters  0,  and  then  used  cross-validation  to  select 
and  0. 

For  the  present  example,  we  chose  the  product  piecewise 
cubic  correlation  (Currin,  etal.  1991): 

R(d„  .dk)  =  nRj(dj).  (5-2) 


validation,  each  of  the  n  experimental  runs  is  deleted  in 
turn,  and  the  data  at  the  remaining  sites  are  used  to  predict 
y  at  the  deleted  site.  Computationally,  this  is  not  as 
exhausting  as  it  seems,  since  it  can  be  shown  that  the  error 
of  prediction  for  response  m,r  at  the  deleted  site  i  is 


®iiirj  Qi(Snir4  l^inr''^i) 

where 

8mi  ~  ("DD  ymrj) 

W  =  CDDJn 


and  q  is  the  inverse  of  the  diagonal  of  C^d-  Here  Cqd  is 
based  on  the  full  n-run  design.  The  cross-validation  root 
mean  squared  error  is  then: 


CVRMSE  = 


1 

2345n 


n  469  5 


i=lin=lr=l 


'/i 


(5.4) 


Given  the  0j's,  this  is  easy  to  minimize  over  the  Pm/s,  but 
minimization  over  the  0j's  requires  iterative  search  -  this 
is  by  far  the  most  (computer)  time-consuming  part  of  the 
prediction  method. 


where  k  is  the  number  of  predictor  variables,  and 


Rj(dj)  =  0  Idjlelj,  (5.3c) 

where  li=[0, 0j  /  2],  l2=[0j  /  2, 0j],  and  l3=[0j,  H- 

There  is  no  particularly  compelling  reason  to  use  this 
instead  of  some  other  family  of  correlation  functions. 
However,  the  piecewise  cubic  does  have  two  appealing 
features:  (i)  R(dj)  decreases  to  0  as  I  dj  I  increases  to  0j,  so 
that  predictions  can  be  made  more  local  or  less  local  by 
controlling  0j,  and  (ii)  y  is  a  cubic  spline  in  every  tj  if  the 
other  tj’s  are  fixed.  (This  is  because  each  element  of  Cto, 
regarded  as  a  function  of  tj,  is  itself  a  cubic  spline.)  Cubic 
splines  are  quite  highly  regarded  as  interpolators  and  data 
smoothers;  Bayesian  prediction  based  on  (5.2)-(5.3)  pro¬ 
duces  an  interpolating  cubic  spline  with  very  little  effort 
on  the  part  of  the  user. 

To  select  the  parameters  by  ”leave-one-out"  cross¬ 


To  save  time  in  the  search  for  the  optimal  correlation 
parameters  (0j’s),  we  used  only  one  response  at  each  node, 
namely  y^s,  the  time  to  50%  filling.  This  seemed  reason¬ 
able  since  we  expected  the  other  response  functions  to  be 
similar  in  form.  The  values  of  p„3,  m=  1,...,469,  and 
0j,  j  =  1,...,  4,  were  chosen  to  minimize  (5.4)  with  r^3  and 
a  divisor  of  469n.  Then,  fixing  the  0j’s  at  these  values,  we 
determined  values  of  p„,  for  all  m  and  r  (again  by  cross- 
validation),  this  time  using  all  5  responses  at  each  node. 

In  our  first  analysis,  the  cross-validation  results  at  particu¬ 
lar  nodes  indicated  that  the  predictions  of  tended  to  be 
lower  than  the  true  values  when  the  area  of  the  charge  was 
smaller  than  average  and  higher  than  the  true  values  other¬ 
wise.  That  is,  the  predictions  had  the  flow  front  moving 
too  fast  when  the  area  of  the  charge  was  relatively  small. 
We  assumed  that  this  was  due  to  the  increase  in  the  height 
of  the  charge  when  the  area  is  small  (since  the  volume  is 
held  constant),  which  would  presumably  result  in  a  slow¬ 
ing  of  the  movement  of  the  front  as  computed  by  TIMS. 
At  any  rate,  we  decided  to  introduce  an  additional  predic¬ 
tor:  ts  =  (tj  -  t2)(t4  -  ts),  which  represents  the  approximate 
area  of  the  charge,  and  we  repealed  the  analysis.  This 
reduced  the  cross-validation  errors,  so  the  area  was  used  as 
a  predictor  in  all  subsequent  predictions. 
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We  then  implemented  the  prediction  equations  for  all 
responses  in  the  form  of  a  short  computer  code  "FTIMS", 
which  serves  as  a  fast  emulator  of  TIMS  for  investigating 
the  effects  of  changing  the  shape  and  location  of  the 
charge.  The  input  and  output  files  for  FTIMS  are  of 
exactly  the  same  form  as  those  for  TIMS.  The  only  differ¬ 
ence  is  that  the  ouq)ut  for  FTIMS  is  based  on  the  predic¬ 
tion  equations  that  followed  from  the  computer  experiment 
we  described  here,  rather  than  the  finite  element  solution 
to  the  differential  equations  of  the  model. 

FTIMS  converts  the  TIMS  input  into  the  site  (tj , . . . ,  ts) 
at  which  predictions  are  desired.  The  15x1  vector  Qd  of 
correlations  between  this  site  and  the  design  sites  arc  com¬ 
puted  using  the  values  of  Gj,  j  =  1,...,  5,  that  we  found  to 
be  optimal  by  the  cross-validation  criterion. 

The  predictions  of  the  responses  yn„,  m  =  l,...,469, 
r=  1,...,  5  are  made  using  (5.1),  where  the  15x1  vector 
w  =  (which  is  the  same  for  all  m,  r )  is  provided  by  a 
fixed  input  file,  as  is  the  15x1  vector  gn„  =  C5d  ymrx) 
the  scalar  FTIMS  then  adjusts  the  five  predicted 
responses  at  each  node,  if  necessary,  to  incorporate  the 
knowledge  that  the  true  responses  are  nonnegative  and 
nondecreasing.  (We  do  not  expect  this  adjusunent  to  be 
needed  very  often,  since  the  predictions  interpolate  data 
that  satisfy  these  requirements.  In  the  test  case  that  we 
report  below,  the  adjustment  was  needed  at  only  two  of  the 
469  nodes.)  Monotonicity  is  enforced  in  a  straightforward 
way,  based  on  the  notion  that,  of  the  five  responses  at  node 
m,  (i.e.,  the  time  to  50%  filling)  is  generally  the  most 
reliable.  This  response  is  therefore  left  unchanged,  and 
y„2  and  yn,4  are  adjusted,  if  necessary,  so  that 
ym2  -  ym3  -  ym4-  Keeping  these  three  predicted  responses 
constant,  y^,,  and  y^s  are  adjusted  similarly. 

To  convert  the  five  predicted  responses  at  each  node  into 
estimates  of  p(T)  at  the  values  of  time  desired,  FTIMS 
again  uses  linear  interpolation.  The  results  are  then 
printed  in  exactly  the  same  form  as  the  output  produced  by 
TIMS.  The  postprocessor  that  normally  runs  on  TIMS 
output  can  then  be  applied  to  the  output  of  FTIMS.  This 
produces  plots  of  the  position  of  the  flow  front  at  various 
times.  In  a  test  case  in  which  t,  =  0.7,  t2  =  0.3,  t3  =  0.75, 
and  t4  =  0.25,  examination  of  these  plots  showed  the 
predicted  front  to  be  just  a  little  ahead  of  the  true  front. 
On  average,  the  predicted  time  to  50%  filling  in  this  case 
was  0.14  seconds  less  than  the  time  calculated  by  TIMS; 
the  root  mean  squared  error  for  y^j  over  all  nodes  was 
0.23  seconds.  In  seven  other  randomly  chosen  test  cases. 


the  root  mean  squared  error  for  yn,3  over  all  nodes  varied 
from  0.01  sec  to  0.68  sec,  with  a  median  of  0.27  sec.  In 
these  test  cases,  the  "true"  times  to  50%  filling,  averaged 
over  all  nodes,  varied  from  6.4-9. 1  seconds. 

The  range  of  applications  of  the  current  version  of  FTIMS 
is  obviously  quite  limited.  Further  generalizations, 
modifications,  and  tests  would  need  to  be  made  before  it 
could  be  a  considered  a  practical  tool  for  optimizing  this 
particular  sheet  molding  process.  Even  at  that  stage,  we 
would  regard  FTIMS  as  only  an  occasional  replacement 
for  TIMS,  when  one  wants  to  consider  many  scenarios 
quickly  and  one  is  willing  to  accept  an  approximate  result. 
The  computing  time  for  the  run  of  FTIMS  in  the  first  test 
case  described  above  was  about  43  seconds  on  a  Sun  3/50 
Workstation,  only  5  seconds  of  which  were  used  to  com¬ 
pute  the  predicted  response  vector  at  each  node.  The  rest 
of  the  time  was  used  for  input  and  output.  We  have 
already  noted  that  each  run  of  TIMS  takes  4-5  minutes  on 
a  Cray  X-MP,  so  the  availability  of  a  practical  and  well- 
tested  version  of  FTIMS  would  permit  more  extensive 
exploration  of  the  effects  of  shape  and  position  of  the 
charge  on  the  movement  of  the  flow  firont. 
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Abstract 

In  the  course  of  designing  a  new  drug,  thousands  of 
candidate  structures  could  be  made  and  examined  by 
empirical  testing.  Medicinal  chemists  would  prefer  some 
way  of  selecting  a  diverse  subset  from  a  list  of  candidates. 
Our  statistical  approach  is  to  use  experimental  design 
technology  for  the  selection  process  and  to  use  computer 
visualization  techniques  for  examination  of  the  resulting 
design.  A  small  peptide  case  is  used  as  an  example.  The 
emphasis  of  this  paper  is  on  the  value  of  visualization 
techniques  in  understanding  the  design  and  in  explicating  the 
design  to  Medicinal  Chemists. 

Introduction 

There  are  countless  numbers  of  molecules  that  could  be 
made  for  testing  as  potential  drugs.  Ten  million  different 
molecules  have  been  made  and  registered;  for  most  of  these 
molecules  that  have  the  characteristics  of  typical  drugs,  there 
are  millions  of  possible  modifications.  Since  it  is 
impossible  to  make  all  these  molecules,  there  is  a  need  to 
create  diverse  ‘■«ts  of  molecule'  that  span  the  range  of 
possible  structures.  Hopefully,  the  "gaps"  between  the 
compounds  in  the  design  set  will  be  small  enough  that 
important  compounds  are  not  missed.  Our  idea  is  to 
describe  molecules  numerically,  use  statistical  experimental 
design  software  to  create  a  design  set,  and  examine  the 
resulting  design  using  3D  rotating  scattergraph  techniques. 
The  process  is  illustrated  using  tripeptides. 

What  is  a  Tripeptide? 

A  tripeptide  is  a  linear,  directed  sequence  of  three 
amino  acids.  There  are  three  variable  regions,  called  side 
groups,  joined  in  sequence  by  amide  linkages. 


There  is  a  beginning  amide  group,  -NH2,  and  a  terminal 
carboxyl  group,  COOH.  There  are  three  variable  regions 
denoted  by  Rj,  R2,  and  R3  and  there  is  a  direction  to  the 
molecule.  The  following  diagram  captures  these  features. 


There  are  20  naturally  ocuring  amino  acids,  so  there  arc 
20x20x20=8,000  possible  tripeptides.  The  cost  of  making 
enough  compound  for  testing  is  about  S500,  so  it  would 
cost  about  four  million  dollars  to  make  all  possible 
tripeptides.  Because  this  cost  is  too  high  and  the  process 
would  lake  a  long  time  to  complete,  it  was  decided  to  make 
a  small,  diverse  set  of  iripepUdes  in  the  hope  that  a  more 
cost  effective  discovery  process  would  result. 

Numerically  Characterize  a  Tripeptide 

Each  of  the  variable  regions  of  a  peptide  can  be 
described  using  three  numbers.  The  size  can  be  measured  as 
volume  or  surface  area.  Electronic  properties  can  be 
measured.  Also  the  lipophilicity  of  the  side  group  can  be 
measured.  Lipophilicity  is  the  propensity  to  dissolve  in  a 
water  or  oil  environment.  The  blood  is  a  water 
environment,  as  is  the  interior  of  a  cell.  Between  the  two  is 
an  oily  cell  membrane.  Drugs  typically  have  to  pass  from 
blood  to  the  interior  of  cells  so  the  watcr/oil  relative 
solubility  is  important. 

To  numerically  describe  a  tripeptide  we  combined  these 
three  numerical  measures  of  side  group  properties  across  the 
three  positions  using  linear  scales. 
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Note  ihc  three  positions  from  left  to  right.  For  each  of  the 
three  numerical  descriptors,  size,  electronics,  and 
lipophilicity,  we  created  three  scores,  mean,  linear,  and 
quadratic.  These  scores  have  physical  interpretations.  For 
example,  if  one  adds  up  the  size  of  each  side  group  at  ctKh  of 
the  three  positions,  then  the  score  reflects  the  total  size  of 
the  tripcptidc.  As  the  tripeptide  is  directed,  the  linear 
component  measures  a  gradient  along  the  tripcptidc. 
Because  the  R2  group  is  typically  on  the  opposite  side  of 

the  tripcptidc  from  the  Rj  and  R3  groups,  the  size  quadratic 

score  measures  the  width  of  the  tripcptidc. 

There  arc  three  measures  of  properties  of  side  groups 
and  there  arc  three  scores  determined  for  each  so  there  arc 
nine  numerical  measures  of  tripcptidc  properties.  In  addition 
to  these  scores,  we  computed  various  interactions  among  the 
nine  scores  to  give  a  total  of  34  descriptive  variables,  ic  each 
of  the  8,(XK)  tripcptidcs  was  characterized  with  a  vector  of  34 
numerical  dc.scriptors.  The  problem  was  to  select  about  100 
tripcptidcs  from  the  8,000  .so  that  the  resulting  set  was  as 
diverse  as  possible. 

Experimental  Design 

There  arc  about  10-^^  ways  to  select  1(X)  objects  from 
8,(X)0.  We  chose  to  use  statistical  cxiwrimcntal  dc.sign 
software  to  make  this  selection.  Our  problem  was  much 
bigger  than  problems  typically  attempted  using  statistical 
experimental  design  software,  so  we  had  to  improvise  using 
various  commercially  available  and  internally  developed 
software. 

Experimental  Design  Software 

1.  ELhip  PC 

2.  ACED  VAX  or  IBM 

3.  OPTEX  IBM3090 

4.  Inhousc  Fortran  IBM3090 

Because  EChip  on  the  PC  would  handle  only  relatively 
small  problems,  various  iterative  strategics  were  used.  For 
example,  one  can  select  a  trial  design  from  a  small  random 
.set  of  points,  say  100  out  of  800,  do  this  several  times,  then 
make  a  final  selection  from  the  "winners"  of  each  of  the  trial 
designs.  Solutions  on  the  PC  took  days  to  compute. 
ACED  code  was  obtained  from  Dr.  W.  Welch  of  the 
University  of  Waterloo  and  modified  to  handle  our  large 
problems.  We  incrca.scd  memory  allocations  and  in  certain 
instances  compiled  for  a  vector  processor.  We  were  able  to 
obtain  solutions  in  hours  on  our  mainframes.  Vector 
processing  greatly  speeded  up  the  selection  process.  After 
much  effort  we  were  able  to  obtain  a  good  82  point  design. 
This  design  had  55  percent  G-optimality.  Several  designs 
consisting  of  82  randomly  selected  points  were  checked. 
These  random  designs  typically  had  G-optimality  of  1  to  2 
percent. 


Several  comments  are  in  order.  Orthogonal 
polynomials  "fold"  a  dimension.  For  example,  a  tripeptide 
that  has  a  large,  small,  large  R-group  in  the  three  positions 
will  be  intermediate  in  size  score  for  the  mean  polynomial 
and  hence  not  selected  as  a  vertex,  but  it  will  be  large  for  the 
quadratic  polynomial  and  will  be  selected  as  a  vertex.  The 
quadratic  polynomial  folded  the  size  .space  moving  a  center 
point  to  an  extreme  point.  D-optimal  design  software 
selects  points  that  arc  vertices  in  a  space.  An  obvious 
strategy  is  to  select  extreme  points  in  the  various 
dimensions  as  starting  points  for  a  design.  We  are 
attempting  to  saturate  a  low  dimension  space  and  do  it  by 
creating  a  higher  dimension  space  that  has  the  right  vertices 
for  the  lower  dimension  space. 

Software  for  Visualization 

The  experimental  design  software  produces  an 
analytical  solution  to  the  selection  of  representative 
tripcptidcs.  Our  rc.sulting  design  had  82  points  in  a  34D 
space.  To  evaluate  this  design  we  used  various  3D  rotating 
scattergraph  programs.  This  work  was  done  on  a  Macintosh 
and  we  used  MaeSpin,  Data  Dc.sk,  and  JMP.  All  three 
software  packages  were  effective,  although  each  had  different 
features  that  helped  in  the  visual  evaluation  of  the  design. 

Our  evaluation  proceeded  as  follows.  First  we  selecteil 
a  random  set  of  8(>0  points  from  the  8.0(X).  This  was 
necessary  as  rotation  speed  was  a  function  of  data  set  size 
Next,  we  added  the  82  design  points  to  the  data  set  and 
marked  them  with  color  and/or  a  distinctive  symbol.  We 
then  proceeded  to  look  at  various  3D  projections  of  the 
random  and  design  points. 

The  following  figures  shows  three  90  degree  view  s  of 
the  first  three  dimensions  of  the  data. 
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In  MacSpin  wc  could  slice  ihrough  the  cloud  of  points 
lo  examine  ihe  number  and  spacing  of  design  points  in 
planes  of  the  data  cloud. 

Note  that  there  are  about  6,000  ways  to  select  three 
dimensions  from  the  34.  Also  note  that  if  a  certain 
projection  looks  bad,  design  points  arc  absent  or  poorly 
spaced,  then  there  is  no  easy  way  to  fix  the  design. 
Dropping  one  design  point  bccau.se  it  is  vi.sually  clo.se  to 
another  in  a  certain  subspacc  and  adding  a  point  to  fill  in  a 
void  arc  likely  to  upset  the  design  in  other  dimensions.  The 
visualization  is  reassuring,  but  it  docs  not  offer  an  easy  way 
to  fix  a  perceived  deficient  design. 


Discussion 

Visualization  helps  assure  the  statistician  that 
analytical  techniques  have  been  correctly  employed.  With 
many  analytical  techniques  it  can  be  difficult  to  detect  il 
gross  mistakes  arc  made.  It  was  quite  assuring  to  the 
statisticians  that  the  points  of  the  final  design  seemed  to 
saturate  the  34 D  space.  To  make  the  82  tripeptides  cost 
about  5()k  dollars  and  took  considerable  time.  Chemists  and 
managers  had  to  evaluate  to  reasonableness  of  the  effort. 
VLsuali/alion  was  very  effective  in  .showing  non-statisticians 
what  was  being  proposed  and  .some  of  the  limitations,  eg  the 
gaps  between  design  points,  of  the  procedure.  The 
collaborators  in  this  project  were  chemists  and  Medicinal 
Chemists  tend  to  think  in  highly  visual  ways.  3D  rotating 
scallergraphs  were  very  appealing  to  them. 

Most  of  this  work  was  done  some  time  ago.  In  the 
meantime  desktop  computers  have  become  much  more 
powerful.  Experimental  design  work  could  now  be  done  on 
work.stations,  particularly  if  overnight  or  weekends  were 
available. 

The  visualization  of  multiple  dimensions  is  still  a 
problem.  With  3D  rotation,  color  and  symbols  it  is 
possible  lo  gel  some  feel  for  4-5D,  but  wc  were  working  in 
34D  and  wc  wanted  lo  have  good  assurance  of  the  .s;ituraiion 
of  9D  in  our  .34D  space.  After  lime  consuming  visual 
examination,  wc  became  comfortable  that  wc  had  done  a 
reasonable  job,  but  il  did  lake  time  and  if  wc  had  found 
deficiencies,  we  would  have  had  no  recourse  but  to  start  all 
over  again. 
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Abstract 

This  paper  investigates  approaches  to  the  design  of 
simulation  experiments  for  training  neural  networks  which 
are  to  be  used  as  classifiers.  Hierarchical  clustering  applied 
to  the  ARTl  and  ART2  (ART  =  Adaptive  Resonance 
Theory)  neural  network  architectures  developed  by 
Carpenter  and  Grossberg  [20.21]  is  the  basis  for  the 
approach.  A  series  of  experiments  based  on  this  approach 
will  test  the  performance  of  ARTl  and  ART2  as  pattern 
classifiers  against  a  variety  of  real  and  artificial  data  sets. 
The  issues  to  be  investigated  in  these  experiments  include 
the  sensitivity  of  performance  to  a  variety  of  network 
parameters,  panem  characteristics,  and  pattern  presentation 
disciplines.  A  background  is  provided  for  those  unfamiliar 
with  neural  networks  in  general,  arxl  with  Grossberg's 
i^rproach  in  particular. 

Some  Background  on  Neural  Networks 
Neural  networks  are  the  latest  super-hyped  glamor 
technology,  following  hard  on  the  heels  of  "artificial 
intelligence."  and  many  statisticians  are  no  doubt  wondering 
how  much  substance,  if  any,  lies  behind  the  smoke.  Many 
of  the  claims  are  of  course  exaggerated,  and  many  techno¬ 
promoters  are  pushing  the  use  of  neural  network  algorithms 
where  (for  example)  a  standard  linear  regression  analysis  is 
sufficient  to  do  the  job.  Nevertheless,  neural  networks 
which  can  outperform  traditional  statistical,  signal 
processing  and  pattern  recognition  approaches  already  exist 
and  have  proved  their  worth  in  a  number  of  applications. 
As  with  artificial  intelligence,  there  are  unresolved 
theoretical  and  practical  issues  of  "machine  learning"  which 
on  the  one  hand  ate  in  desperate  need  of  statistical 
assistance,  and  on  the  other  hand  stretch  both  theoretical 
and  tqrplied  statistics  to  their  limits. 

There  are  many  motivations  for  investigating  neural 
networks,  and  a  large  number  of  different  a{^roacbes. 
Understanding  brain  function  was  the  inifial  motivation  for 
the  study  of  neural  networks,  and  remains  the  primary 


motive  for  that  aspect  of  the  subject  which  Arbib  [1]  has 
called  "computational  neuroscience."  In  what  must  be 
called  the  pioneering  paper  of  the  subject.  McCulloch  and 
Pitts  [2]  described  how  neurons  with  firing  thresholds 
function  as  logic  gates,  and  how  interconneaed  groups  of 
neurons  could  perform  operations  describable  by  a  logical 
calculus.  In  their  next  paper  [3],  these  authors  proposed  a 
totally  different  computational  model  of  memory:  an  analog 
spatial  map  developed  as  a  consequence  of  the  dynamics  of 
neural  activity.  While  the  analog  aprproach  has  since 
predominated  in  neuroscience,  the  two  models  are  not 
contradictory,  but  complementary,  a  point  emphasized  by 
Von  Neumann  [4]  and  apparent  in  many  current  neural 
network  models.  Neural  analogies  were  also  central  in 
Wiener’s  vision  of  a  new  discipline  of  cybernetics  [5]. 

The  complementary  nature  of  logical  (digital)  and  dynamic 
(analog)  activity  provides  the  second  motive  for 
investigating  neural  networks,  that  of  designing  massively 
parallel  hybrid  computers  which  mimic  to  some  extent  the 
architecture  of  the  brain.  The  first  "neurocomputer"  was 
Rosenblatt's  "perceptron"  [6].  The  subsequent  flurry  of 
excitement  led  to  a  brief  period  of  heavy  government 
funding  of  "brain  machines”  in  the  1960's  in  an  atmosphere 
of  techno-hype  that  makes  the  current  round  pale  by 
comparison.  Hardware  limitations,  early  technical  failures, 
arxl  the  devastating  impact  of  Minsky  and  Papert’s  [7] 
analysis  led  to  a  15-year  long  "dark  age”  in  which  neural 
networks  were  eclipsed  by  "artificial  intelligence." 

The  current  revival  began  in  1982  with  the  introduction  of 
the  Hopfield  network  [8,9].  Since  then,  many 
neurocomputing  algorithms  have  been  proposed  or  revived, 
the  most  popular  being  the  Boltzmann  machine  [10],  which 
owes  a  great  deal  to  the  work  of  statisticians  Geman  arxl 
Geman  [11],  and  above  all,  back-propagation  [12,13].  which 
has  become  almost  synonymous  in  many  people's  minds 
with  neural  networks.  Ease  of  implementation  arxl  some 
impressive,  well-crafted  applications  of  "backprop"  have 
unfortunately  overshadowed  its  limitations,  arxl  the 
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imponance  of  the  steady,  ongoing  developments  which 
continued  in  spite  of  the  "daric  age." 

Another  impetus  for  the  current  revival  of  neural  networks 
is  the  availability  of  new  technologies  for  hardware 
implementation.  Large-scale  integrated  circuits  operating  in 
either  the  familiar  digital  mode  or  in  analog  (subthreshold) 
mode  based  on  neural  architectures  are  already  being 
produced,  and  optical  processors  are  in  the  design  stage. 

Statistical  theory  is  already  playing  a  role  in  understanding 
learning  performance  [14,15].  In  a  closely  related 
development,  statistical  decision  theory  is  the  latest  "hot 
topic"  in  the  field  of  machine  learning,  not  only  for  neural 
networks  but  also  for  "conventional"  artificial  inteUigence 
[16]. 

Learning  Properties  of  Neural  Networks 

Unlike  a  digital  computer,  which  performs  its  computations 
by  following  a  series  of  programmed  instructions,  a  neural 
network  is  trained  by  presenting  it  with  a  series  of  examples 
of  the  inputs  for  which  outputs  are  desired.  If  desired 
outputs  are  presented  simultaneously  with  inputs,  and  the 
neural  net  adjusts  its  connection  weights  so  that  its  ou^ut 
approximates  the  desired  ou^ut,  the  training  is  called 
supervi.sed:  otherwise  it  is  uasupervised.  Once  the  training 
period  is  over,  the  neural  network  is  presented  with  new 
inputs  for  recognition.  The  parallel  with  statistical 
estimation  and  prediction  is  immediately  s^parent.  A  closer 
examination  reveals  that  supervised  learning  resembles 
noirparametric  regression  analysis  if  the  ou^ut  is 
continuous,  and  nonparametric  discriminant  analysis  if  it  is 
discrete,  while  uasupervised  learning  corresponds  to  either 
cluster  analysis  or  nonparametric  density  estimation. 

A  major  difference  between  training  a  neural  network  and 
replying  a  statistical  algorithm  is  that  the  input  examples 
are  presented  to  the  network  one  at  a  time.  In  this  respect  a 
neural  network  resembles  a  recursive  statistical  algorithm 
such  as  the  Kalman  filter.  But  what  many  neural  networks 
learn  depends  on  the  order  in  which  the  patterns  are 
presented,  which  is  typically  not  the  case  for  a  statistical 
algorithm.  It  is  common  practice  to  cycle  through  the 
training  set  repeatedly  until  the  network  weights  cease  to 
change,  or  some  other  criterion  for  stability  is  satisfied.  In 
some  cases  the  order  of  presentation  is  varied  randomly  or 
systematically  frean  cycle  to  cyde,  in  others  it  is  not. 

Despite  these  differences,  it  Ls  clear  that  stafistical  methods 
will  be  usefiil  in  evaluating  neural  network  performance. 
The  usual  practice  at  present  is  to  set  aside  a  portion  of  the 
training  set  for  testing,  and  evaluate  the  performance  of  the 


trained  network  on  the  testing  set  Testing  corresponds  to 
the  statistical  practice  of  cross-validation,  and  the  issues 
surrounding  the  tradeoff  between  bootstrapping  and  cross- 
validation  are  especially  complex  in  this  context. 

Grossberg's  Neural  Principles  and  Adaptive 
Resonance 

While  many  authors  have  modeled  specific  aspects  of  brain 
function,  the  most  comprehensive  theory  and  the  broadest 
collection  of  models  of  cognitive  activity  has  been  produced 
by  Stephen  Grossberg  and  his  collaborators  [17.18.19]. 
Their  approach  is  to  search  for  and  apply  general  prindples 
underlying  a  wide  range  of  experimental  evidence  fi'om 
neurophysiology  and  psychophysics. 

Grossberg  begins  with  a  physically-based  model  of  neural 
activity,  leading  to  a  system  of  non-linear  differential 
equations  for  synapse  activities  and  connection  weights. 
The  time  constants  of  the  connection  weights,  which  store 
long-term  memory,  are  much  longer  than  those  of  the 
synapses  which  constitute  short-term  memory.  Unlike 
back-propagation  and  other  feedforward  algorithms  which 
do  not  always  converge,  global  dynamic  stabihty  is  built  in 
to  the  structure  of  these  equations.  The  details  of  the 
differential  equations  are  highly  flexible,  allowing  for  a 
wide  variety  of  architectures  enable  of  representing 
important  aspects  of  vision,  speech,  memory,  conditioned 
and  unconditioned  responses,  and  even  reasoning.  For  a 
more  detailed  overview,  see  the  references  above,  especially 
(Ihapters  1  and  13  of  [18].  ((3hapts.  1  and  12  of  [18]  also 
appear  in  Anderson  &  Rosenfeld's  collection  of  "classic" 
papers  [AR]  as  Chapts.  24  and  19  respectively.) 

Adaptive  Resonance  Theory  (ART)  refers  to  the  neural 
feedback  mechanisms  which  have  been  developed  to  easure 
stable  encoding  of  incoming  stimuli  within  the  framework 
of  this  broader  theory.  ARTl  arxl  ART2  are  general- 
purpose  neural  network  modules  based  on  ART  principles. 
Functionally,  they  provide  a  means  for  rapid,  unsupervised 
learning  and  classification  of  incoming  patterns  (represented 
as  extremely  high-dimensional  vectors)  based  on  "reset"  of 
poor  matches  with  generalizations  of  previously  learned 
patterns  (templates),  and  "resonance"  with  good  ones. 
ARTl  is  designed  for  binary  ("black  &  white")  patterns, 
while  ART2  operates  on  continuous- valued  ("grey-scale") 
patterns.  Both  ARTl  and  ART2  deperxl  on  a  single 
parameter  called  the  "vigilance  level"  which  determines  the 
fineness  of  the  resulting  classification.  Higher  vigilance 
results  in  a  larger  number  of  classes.  Details  will  be  found 
in  Carpenter  and  Grossberg  [20,21].  While  most  of  the 
computing  tasks  performed  to  date  by  these  "ART  units" 
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involve  panem  learning  &  recognition,  this  limitation  is  not 
inherent. 

As  widi  many  other  neurocomputing  algorithms,  the  natural 
computer  implementation  of  ART  is  on  an  integrated  silicon 
chip,  optical  circuit  or  other  physical  device.  But  until  such 
devices  are  more  readily  available,  simulations  of  neural 
algorithms  will  be  carried  out  on  digital  computers.  Unlike 
some  other  neurocomputing  algorithms,  "fast  learning" 
special  cases  of  the  ARTl  and  ART2  algorithms  are  easily 
coded,  and  many  experiments  and  practical  applications  of 
ART  can  be  implemented  in  a  digital  computing 
environment. 

Order  Dependence  of  Neural  Network 
Learning:  A  Remedy  for  ARTl 

The  fast  learning  versions  of  ARTl  and  ART2  may  be 
regarded  as  clustering  algorithms  for  the  purpose  of 
understanding  training.  In  some  contexts  the  order  of 
pattern  presentation  may  be  meaningful,  and  it  should  be 
allowed  to  affect  the  resulting  classification.  But  for  many 
technological  applications,  the  presentation  order  of  the 
training  patterns  is  irrelevant,  and  the  dependence  of  the 
classification  of  training  patterns  and  templates  on  input 
order  is  undesirable.  Furthermore  it  is  a  sign  that  the 
resultant  classification  is  statistically  inconsistent. 

Under  these  conditions  it  would  be  desirable  to  find  a  way 
to  train  the  network  to  find  a  classification  of  the  training 
patterns  which  is  free  from  this  problem.  For  ARTl  it  is 
possible  to  do  so.  and  it  appears  likely  that  it  will  also  be 
possible  for  ART2.  The  resulting  classification  may  be 
described  as  a  "canonical"  one  (it  is  not  clear  that  it  is 
unique).  If  the  nodes  of  the  ARTl  unit  are  encoded  with  the 
templates  corresponding  to  this  canonical  classification, 
each  training  pattern  will  "resonate"  automatically  with  the 
correct  node,  regardless  of  the  order  of  presentation. 

This  result  depends  on  the  fact  that  the  ARTl  and  ART2 
algorithms  may  be  characterized  by  similarity  measures  of 
the  type  used  in  cluster  analysis.  Furthermore,  as  a  result  of 
a  1978  theorem  of  Grossberg  (Chapter  12  of  [18]  or  Chapter 
19  of  [AR]),  ART2  will  correctly  identify  pattern 
classifications  which  are  sufficiently  separated  from  one 
another.  The  author  of  this  paper  has  shown  a  similar  result 
for  ARTl  [22],  although  the  similarity  measure  proposed  in 
that  p^r  must  be  slightly  modified.  By  putting  these  two 
facts  together,  it  is  possible  to  show  that  a  single-linkage 
hierarchical  cluster  analysis  of  the  training  patterns  based  on 
the  .similarity  measure  will  produce  a  nested  family  of 
canonical  classifications  corresponding  to  a  family  of  ARTl 
units  with  vigilance  levels  ranging  fiom  zero  to  one. 


(Vigilance  zero  places  all  patterns  in  one  class,  while  a 
value  of  one  creates  a  separate  class  for  each  pattern).  More 
specifically,  the  open  interval  (0,1)  is  broken  into  open 
subintervals:  each  subinterval  is  a  range  of  vigilance  level 
values  which  will  give  a  specific  classification.  Relating 
learning  to  single-linkage  cluster  analysis  may  prove  critical 
in  establishing  the  statistical  consistency  of  the  learning 
process;  see  Hartigan  [23].  The  mathematical 
demonstration  of  these  results  will  be  presented  in  a  future 
publication. 

Thus  the  order-dependence  of  the  learned  result  of  training 
has  been  eliminated  by  a  statistical  algorithm  which 
"simulates"  the  neural  network,  handling  the  training 
elements  simultaneously  instead  of  one  at  a  time.  It  may  be 
feasible  to  characterize  other  neural  network  classification 
algorithms  by  a  similarity  measure,  and  use  the  same 
^proach  to  achieve  order-independent  training  of  the 
network. 

Simulation  Experiments  with  Neural  Network 
Training 

In  many  applications,  an  important  part  of  the  simulation  of 
ARTl  learning  and  recognition  performance  will  involve 
the  use  of  hierarchical  clustering  to  remove  the  effect  of 
training  order  on  what  is  learned.  This  presupposes,  of 
course,  that  the  investigator  has  no  predetermined 
classification  in  mind.  If  he  does,  and  this  classification 
agrees  with  a  canonical  one,  his  problem  is  solv^  (and 
ARTl  is  an  extremely  appropriate  architeaure  his 
problem!).  When  this  is  not  the  case,  a  cenain  amount  of 
classification  error  may  be  tolerated  (as  it  general!}  ;s  in 
discriminant  analysis). 

If  the  training  elements  are  ver}’  high-dimensional.  V..7 
complex,  or  very  numerous,  this  tqrproach  may  not  .'e 
computationally  feasible.  Even  for  applications  where  it  is 
practical,  many  other  statistical  issues  must  be  resolved. 
The  need  to  test  the  performance  of  the  network  raises  the 
issue  of  "bootstr^ping  vs.  cross-validation"  rxtted  earlier. 
For  example,  how  does  one  compare  an  assortment  of 
hierarchical  clusterings  arising  fiom  bootstrapped  or  cross- 
validated  training/testing  samples?  The  answer  will  depend 
in  part  on  whether  or  not  memory  (templates  in  the  case  of 
ARTl)  will  be  "frozen"  in  the  implementation. 
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ABSTRACI 

Wc  propose  to  classify  points  in  by  functions 
related  to  two-layer  (a  single  hidden  layer)  feedforward 
artificial  neural  nets  (ANNs).  riicsc  functions,  dubbed 
dynamic  ANNs  (DANNs),  arise  in  a  rather  natural 
way  from  probabilistic  and  also  sfatislical  consider¬ 
ations.  We  treat  the  binary  classification  problem  and 
outline  an  approach  to  the  n-ary  cla.ssification  problem. 
There  are  two  key  ideas.  The  probabili.stic  idea  is  that 
DANNs  arc  conditional  probabilities  in  certain  mixture 
models.  'I'hc  statistical  idea  is  that  the.se  models,  and 
hence  the  DANNs  defined  by  them,  arc  conveniently 
trainable  by  an  expectation  -  maximization  (T,M)  al¬ 
gorithm. 

IN  TRODUt  TION 

('onsidcr  classification  of  points  j:  e  R''  (feature 
vectors)  by  using  continuous  functions  of  x  for  the 
probabilities  of  classes.  Tor  binary  classification  this 
means  any  continuous  function  R'' -►[(),  1]  and  for 
n-ary  classification  any  continuous  function  from  R'^ 
to  the  n  —  1  dimensional  simplex  in  [0,1]".  In  this  pa¬ 
per  we  focus  on  binary  classification  and  merely  sketch 
the  generalization  to  n-ary  classification. 

('ybenko  (1989)  has  shown  that  any  continuous 
function  defined  on  a  compact  set  in  R''  can  be  uni¬ 
formly  approximated  by  a  two-layer  (=  one  hidden 
layer  )  artificial  neural  net  (ANN)  /r,  i.c.  a  function 
of  the  form 


ploit.  A  DANN  /:  R'^  — *  [0,  I]  is  defined  as  a  condi¬ 
tional  expectation 
(2) 

where  V  is  the  unit  step  function 
(•■>) 

and  Y  is  jointly  di.stributcd  with  X.  If  Y  were  a  linear 
function  of  X  then  IJ  would  be  a  perceptron,  in  fact  a 
'thrc.shold  logic  unit'.  Wc  shall  construct  Y  by  adding 
noise  to  a  randomly  chosen  linear  function  of  X.  We 
cannot  observe  the  identity  of  such  a  linear  function, 
hence  the  'hidden  perceptron'  of  the  title.  This  con¬ 
struction  produces  a  joint  distribution  P  for  {X,)')  in 
which 

V  /"v  ^ 

(4)  Ax)  =  2j  2j  ^9/’  /• 

7=1  \f=I  / 

'The  form  in  (4)  differs  from  fc  in  (1)  only  in  that  the 
real  constants  arc  replaced  by  certain  hidden  state 
probability  functions  nj{x).  Wc  shall  argue  elsewhere 
that,  as  is  the  ease  for  the  ANN  functions  fc,  the  class 
of  DANN  functions  /  can  uniformly  approximate  on 
a  compact  set  any  function  which  is  continuous  there. 
It  will  become  clear  that  DANNs  arc  not  ANNs  except 
in  the  degenerate  ease  of  where  the  itj  are  constant  in 
x;  this  corresponds  to  a  certain  statistical  independence 
in  our  model.  Conversely  an  ANN  with  one  or  more 
negative  itj  cannot  be  a  DANN  so  neither  class  con¬ 
tains  the  other. 


V  /'v  \ 

( > )  feix)  =  2j  1 

./=i  Vt=i  / 

where  k  is  a  sufficiently  large  integer,  a:  R  ->  [0,1]  is 
a  sigmoid  and  the  rt„  IV, y  arc  constants.  Barron  (1991) 
has  given  a  bound  on  the  number  k  of  of  summands 
required  for  approximation  within  prescribed  precision. 

A  different  class  of  approximating  functions,  which 
wc  call  dynamic  artifical  neural  nets  (DANNs)  will  be 
introduced.  Unlike  (I),  these  functions  arc  based  on  a 
probabilistic  description  of  the  classification  process 
and  hence  enjoy  certain  properties  which  wc  shall  cx- 


Thc  motivation  for  this  approach  to  classification 
is  both  theoretical  and  practical.  On  the  one  hand  wc 
wish  to  use  statistical  optimization  criteria,  .such  as  ML 
or  Bayes,  for  training  the  clas.sificr,  and  at  the  same 
time  we  wish  to  accomplish  the  training  with  the  sim¬ 
plest  numerical  algorithm.  Tor  this  reason  we  com¬ 
plete  the  choice  of  the  form  of  the  joint  distribution  of 
{X,}')  so  as  to  also  allow  the  construction  of  an  T,M 
algorithm  (Dempster  ct  al.(1977),  Mcilijson  (1989))  for 
estimating  its  parameters.  The  liM  algorithm  for 
learning  the  distribution  /’  thus  becomes  an  iinlircct 
but  simple  training  algorithm  for  DANNs  fx)  By 
the  way  of  contrast:  the  standard  current  approaches 
u.sing  ANNs  consist  of  some  curve  fitting  ("back 
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propagaticni"  etc.)  methods.  These  typically  vary  the 
constants  defining  /  so  as  to  minimize  the  sum  of 
squared  errors  in  predicting  the  unit  sum  bitstring  z 
which  encodes  (see  the  discussion  below)  the  index  erf 
the  class  corresponding  to  x  by  its  expectation  f{x). 

Tin:  CLASSII ICA nON  I'ROBLKM 

A  probabilistic  version  of  the  classification  problem 
is  this:  given  a  completely  specified  joint  distribution 
P  of  the  random  pair  (T,/)  with  X  e  IR''  an  d 
/  e  {0, ... ,  rt  —  1),  find  a  classifier  function 
{1,...  ,n}  which  minimizes  the  probability  of 
misclassification  P('P(X)  ^  /).  The  well  known  sol¬ 
ution  is  to  let  'I'(.x)  be  the  least  (say)  nonnegative  inte¬ 
ger  which  achieves  max  P(/  =  /jA  =  .r). 

1  <  f  <  n 

In  our  approach  to  classification  a  coded  version  Z 
of  the  class  index  /  is  introduced.  Corresponding  to  a 
random  feature  vector  X  let  f.Z  memberof  Ibrace  0, 1 
rbrace  sup  m  be  a  onc-to  one  encoding  of  the  class 
index  /  as  a  bitstring.  Tor  the  sake  of  concreteness  the 
reader  may  wish  to  regard  Z  as  the  binary  expansion 
of  the  class  index  /  and  in  this  ease  m  is  the  least  in¬ 
teger  m>=  log2n.  The  more  popular  encoding  consists 
of  encoding  the  event  /  =  /  as  the  hitstring  associated 
with  the  i  —  th  vertex  of  the  n  dimensional  simplex;  in 
this  ease  m  =  n.  It  is  obvious  that  from  the  probabi¬ 
listic  point  of  view  it  does  not  matter  which  one-to-one 
encoding  one  chooses.  Contrast  this  w  ith  the  statistical 
point  of  view,  i.e.  the  typical  practical  situation  wherein 
the  joint  distribution  of  (T,/)  is  not  specified  com¬ 
pletely;  in  this  case  one  has  some  training  data 
(5)  7- ={(x, ./,)(/=  1 . A} 

to  work  with  instead.  We  assume  that  7'  is  a  random 
sample  from  the  distribution  of  (,X,f).  The  usual  statis¬ 
tical  approach,  which  (for  the  lack  of  a  better  idea)  we 
also  adopt,  is  to  estimate  the  joint  distribution  /’  by  a 
distribution  P^  and  thereafter  ignore  the  error  of  the 
estimate.  Since  some  functions  arc  easier  to  estimate 
than  others,  it  is  no  longer  clear  that  different  en¬ 
codings  arc  equally  good.  We  do  not  pursue  the  en¬ 
coding  issue  but  simply  assume  that  some  encoding  is 
specified.  It  is  likely  that  ultimately  some  problem  de¬ 
pendent  encoding  will  will  be  preferred  to  cither  of  the 
two  simple  encodings  mentioned  above;  an  encoding 
cho.scn  to  optimize  the  performance  of  the  trained 
classifier. 

BINARY  I’I:R( :i:P  IRON  AND  ITS  DANN 

We  now  fix  m  —  I  so  Z  is  just  a  random  bit.  A  joint 
distribution  for  the  input-output  pair 
(T,/)  e  R'' X  (I, ...  ,  «)  implies  a  joint  distribution  for 


the  coded  version  (A',Z).  The  probability  element  for 

(A',Z)  has  the  form 

(6)  ^Ax)p(z\x) 

where  is  a  density  on  and  for  fixed  x  ,  p(z\x)  is  a 
probability  on  {0,1}.  In  the  terminology  of  the  T:M  al¬ 
gorithm,  a  sample  from  the  distribution  of  the  observ¬ 
able  pair  (A',Z)  is  the  incomplete  data'  and  g4x)Pi^\'() 
is  the  incomplete  data  model.  When  this  is 
parametrized  as  gw(.x|0)/>(z|.v,  0)  then  the  corresponding 
incomplete  data  likelihood  function  is 
T 

I 

A  direct  numerical  approach  attempts  to  maximize  (7). 
Instead,  the  TM  algorithm  iteratively  maximizes  a  dif¬ 
ferent  function  which  however  has  a  maximum  at  the 
same  0. 

In  order  to  model  the  generation  of  the  data  and  to 
construct  an  T.M  algorithm,  we  now  introduce  the 
complete  data  model.  The  idea  is  to  make  local  models 
of  the  joint  distribution  of  the  feature  vector  X  and  a 
noisy  locally  linear  function  Y  whose  only  purpose  is 
to  define  the  classifying  bit  Z.  l  ocality  is  achieved 
through  the  u.sc  of  a  mixing  variable  ./e  (I, ...  ,k}  w-ith 
/’(•/  =  /)  =  «.  let 

(8)  (T,T,./)  TeR*^,  TgR"’,  ./e(l . k] 

denerte  the  complete  data.  Conditionally  on  ./  =  /  the 
density  of  (T.  I")  is  d  -t-  I  dimensional  (laussian  with 
mean  vector 


(9) 


and  covariance  matrix 


(10) 


Observe  that  X  has  a  (jaussian  mixture  distribution 
describing  the  feature  space  and  T  is  the  noisy  signed 
distance  to  a  hypcrplane  determined  by  the  coefficients 
of  the  random  linear  function  E(Y\X,.k).  (Actually  the 
(laussian  assumption  is  not  necessary;  any  tractable 
d-dimen,sional  kernel  will  do  here.  The  conditional 
distribution  of  )’  given  both  X  =  x  and  ./  =  /  can  also 
be  replaced  by  any  tractable  non-Claussian  distribution 
but  the  latter  must  have  a  location  parameter  which  is 
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linear  in  x.  The  <7+1-  dimensional  CJaussian  assump¬ 
tion  automatically  satisfies  this  condition.) 


hence  to  estimate  the  complete  data  log  likelihood.  The 
latter  is  then  (trivially)  maximized  to  get  in  the  r-th 
iteration. 


Without  loss  of  generality  we  can  parametrize  the 
rest  of  Ty  as  follows: 

(11)  r,2,  =  r'2,;  =  r,,// 

where  (ij  arc  the  regression  coefficients  of  the  regression 
of  Y  on  X  given  ./  —  j  and  the  variance  of  Y  given 
./  =  7  is 

(12)  r^r,  =  y]  +  P'.i^Pj 


where  yj  is  the  conditional  variance  of  Y  given  not  only 
./  =  /.  hut  also  X  =  X  (residual  variance).  We  now  put 

(1.1)  z=<;(n 

where  U  is  the  unit  step.  Then 
k 

E(Z]X  =  x)  =  ^  /’(,/  =71^  =  .r)/’(Z  =  I  IT  =  X,./  =  7) 
7=1 

k 

(14)  =  ^,r^.(x)/’(K>  0|T  =  x, ./  =  /■) 


where  n^x)  =  rf/(x:)(j:)  =  /’(,/  —  j\X  =  x)  and  <I>  is  the 

P'j 

standard  normal  integral.  Setting  w,,  =  -j-  and 

d 


Wnj  = 


*'/-  TP  a 

,  and  choosing  the  sigmoid  to  be 


,  - 

standard  Gaussian  (^171 
(15)  <7(g)  =  <^();)  =  I"  4>{u)du 


with  —  (27r)  V  we  sec  that  the  DANN  in  (4) 
is  precisely  the  conditional  expectation 

(16)  /(x)  =  E(ZIX  =  x)  =  /’(Z=  l|T  =  .x'). 


Tin:  I  RAININC  AIXiORI HIM 


nil:  i:  sri:p  oi  iraininc; 


t  he  Iv-S  I'i:!’  in  the  usual  (laussian  mixture  prob¬ 
lem  estimates  all  the  unobservable  complete  data  suffi¬ 
cient  statistics.  I  hcsc  arc  unobservable  because  ./,  the 
mixture  index,  is  hidden.  Our  problem  is  .similar  but 
differs  from  this  in  that  in  addition  to  the  unavailability 
of./,  the  r.v.  Y.  is  also  hidden  except  for  its  sign.  Thus 
the  conditional  expectations  required  here  arc  based  on 
less  information  than  in  the  usual  mixture  problem. 
1-ct  S(/1)  be  1  or  zero  as  /I  occurs  or  not.  In  our  setup 
the  complete  data  sufficient  statistics  arc 
r 

Nj-  ys(.f,=.f) 

SXj^  2^S(J,=J)X, 

SXX'j  = 

(18) 

SYX'j  = 

X'H-^=7-)y? 

SYj= 

t=  t 


\S{J,=J)X,X', 

)W,=J)Y,X', 


'I'hc  corresponding  conditional  expectations  arc 
7 


_,Pj{X,,Z,) 

SXX'j  =  ^ 

]pj{X„Z,)X,X', 

SYX'j=  ^ 

]pj{X„Z,)Y,{/)X 

SYYf=  ^ 

]pj{X„Z,)^,U) 

STjJl 

t  = 

]pj{X„  Z, )?,(/) 

1 

1  x:t  n  denote  a  vector  whose  components  form  a  list 
of  all  the  unknown  parameters  of  the  distribution 

(17)  f?  =  {{aj  Hjyj  rj)\i  =  1 , ... ,  k). 

After  initializing  0  —  0”  the  F M  algorithm  for  expo¬ 
nential  family  (our  .setup)  iterates  two  steps,  (F,):  get 
the  conditional  expected  values  of  all  complete  data 
sufficient  statistics  given  the  incomplete  data  and  (M): 
use  the.sc  to  estimate  their  unoKscrvablc  versions  and 


where 

Pjix,,  Z,)  =  IV,  =71''"/  =  X,,  Z,  =  z,\  fl). 

(20)  F,(/)  =  /';(<y./)}'|T,Z;<7), 

^,{j)^iVj(r)Y\i=j,x^x-,0) 

with  0=0'  '.  It  is  easily  checked  that  /’,(.x,l)  is  given 
by 
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n/xm  woj  +  J  w,^, 


X  Jr,.(x)<I)  j  H'o,.  +  ^ 


and  where,  similarly,  />j(Jf,())  is  given  by 

f  V  ' 

-  ‘'’(V  -  ^ij^i 
\  /=  1  , 


k  /  d  ' 

/=1  V  /=l 


We  still  need  to  define  Y(J),  P(/),  andA'y'(/).  Since 
E{XY\X,Z)  =  X  E{Y\X,Z)  ,  wc  need  only  the  first  two. 
We  have  for  r  =  1 ,2; 

E(Si(./)Y'\X  =  x,7,=z) 

=  pi{x,z)f:(  Y^\X  =  jc,  Z  =  ?, ./  =  i). 

The  last  expectation  may  be  evaluated  as  follows. 
Writing 

d 

(24)  ^,.  =  4,.(x)  =  v,.+  J]pji(xj-,ij) 

/=  I 

we  have 

E{r\X  =  x,Y  >{<)(),. f=i) 

(25)  =AW-^)  +  y,Ko,)Uo.>(^)--^) 

where  Y<,:  is  a  standard  scalar  Gaussian  r.v.  with  mean 
zero  and  variance  one;  its  two  conditional  moments  are 
not  hard  to  obtain  in  closed  form  and  wc  omit  them. 


HIE  M-Sl  EP  OF  IRAINING 


and  'N'.  Wc  tested  the  model  on  an  independent  set 
of  test  data  from  the  same  phones.  Our  best  results 
were  obtained  with  .*>  hidden  states  yielding  an  error 
rate  of  7.5  percent  on  the  test  data  and  .10  percent  on 
the  training  data.  I  his  is  virtually  identical  with  rc.sults, 
on  the  same  data,  that  had  been  obtained  by  training 
and  testing  comparable  neural  nets  with  the  usual  fixed 
weights. 

HIE  II  (  LASS  PROBI.EM. 

Suppose  that  the  bitstring  encoding  Ze{l), I}""  is 
given  and  (A’,Z)  has  some  joint  distribution.  Define 
the  complete  data  model  by 

(26)  (X,Y,./)  A'eIR'',  KgIR"',  ./ e  ,  k}. 

and  set  Z/ =!/(>,)  /=!,...  ,/n.  In  this  case  condi¬ 
tionally  on  ./  =  j  the  density  of  (A',  Y)  is  chosen  to  l>c 
d  +  m-dimcnsional  Gaussian.  For  convenience  in 
computing  /’(/ = /|A' =  jr)  =  PfZ  =  zlA”  =  x)  wc  take 
the  conditional  covariance  matrix  of  Y  given  both 
A’  =  X  and  ./  —  j  to  be  diagonal.  Then  PfZ  =  z|A'  =  x) 
is  given  by 

k  m 

(27)  Y,  '  ~  ^y(-^))'  ” 

,/=l  (=1 

where 

(28)  py(x)  =  />(r,.>0|A-  =  x,  ./=,/). 

Wc  shall  argue  cl.scwhcrc  that  as  in  the  case  of  m  —  1 
the  Bayes  classifier  based  on  the  true  joint  distribution 
of  (A',Z)  can  be  uniformly  approximated  with  such 
forms.  The  IvM  algorithm  is  again  applicable;  the  only 
new  object  is  the  conditional  covariance  between 
components  of  }'  given  both  X  and  ./.  While  this  is  zero 
by  constniction,  enforcing  this  constraint  in  the  M-step 
requires  some  care. 


Assemble  the  re.sults  of  the  I',-S  ri',P  to  form  esti¬ 
mates  of  the  k  mean  vectors  and  the  k  covariance  ma¬ 
trices  of  the  model  and  extract  the  required  regression 
coefficients  Pj  and  residual  variances  yj.  (xrmputc  the 
neural  net  weights  and  thresholds  after  the  last  iter¬ 
ation. 

NI;MER  IC  A  E  EX  PER  liVf  EN I S 

Wc  trained  various  versions  of  the  model  on  data 
generated  by  the  model  itself.  In  these  experiments  wc 
verified  that  the  model  behaves  as  the  theory  predicts; 
in  particular  wc  were  in  each  case  able  to  recover  the 
parameters  of  the  generating  model  with  reasonable 
accuracy.  In  addition,  wc  trained  the  model  on  50  di¬ 
mensional  .speech  data  belonging  to  the  phones  'M' 
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Abstract 

The  paper  uses  several  examples  to  illustrate  a  distinctive 
difference  between  alternative  models  of  biological  systems: 
those  of  the  mathematical  vs.  those  of  the  algorithmic  format. 
Primary  among  these  comparisons  are  the  models  of  research¬ 
ers  dealing  with  neural  networks  versus  those  of  artificial 
intelligence  [AI]  researchers  who  predicate  their  work  on  the 
cognitive  sciences.  We  show  how  the  literature  of  biology 
itself  reveals  why  one  approach  to  the  modelling  of  biological 
systems  is  more  likely  to  succeed  than  the  other.  We  compare 
historically  the  acclaimed  successes  of  non-mathematical  bi¬ 
ologists  [e.g.,  Darwin’s  ORIGIN  OF  THE  SPECIES  and 
Lorenz’s  paper,  “Fashionable  Fallacy  of  Dispensing  with 
Description  “]. 

We  include  in  the  paper  a  review  of  the  literature 
dealing  with  the  principles  for  conducting  the  design  and 
analysis  of  experiments  with  computerised  stochasic  models, 
applicable  whether  their  dynamics  are  ‘controlled’  within  the 
computer  mathematically  or,  alternatively,  algorithmically. 
Exemplary  models  of  AI  systems  are  the  current  software 
packages  being  implemented  throughout  the  research  and 
university  communities:  viz.,  bibliographic  retrieval 
progammes  which,  e.g.,  include  statistical  analyses  for  the 
purpose  of  suggesting  alternative  subject-search  strategies. 

1  Introduction 

For  the  past  four  decades  [since,  e.g.,  McCulloch  and  Pitts 
(1943)],  researchers  in  AI  have  become  very  slowly  aware  of 
the  distinctive  advantage  which  algorithmic  models  possess 
over  those  other  computerised  models  of  the  strictly  math¬ 
ematical  format.  Quite  recent  authors  [e.g.,  Amit  (1989)] 
persist,  particularly  in  the  literature  of  neural  networks,  with 
their  fascination  with  mathematical  modelling,  as  though  the 
success  of  the  mathematically-expressed  Newtonian  models 
(of  physics)  will  automatically  be  conferred  on  their  own  work. 

On  the  other  hand,  Mihram  (1973)  noted  that  philoso¬ 
phers  Sayre  and  Crosson  (1963)  had  been  struggling  with  the 
non-mathematical  (“non-formalized’’)natureof  computer  pro¬ 
gramming  as  it  might  affect  the  modelling  of  mind,  a  mental 


struggle  being  conducted  as  well  in  the  context  of  computer¬ 
ised  modelling  of  social  systems  in  that  same  decade  by  the 
mathematician  Kemeny  (1969). 

Completely  generalizing  this  struggle  to  biological 
systems,  including  not  only  neural  networics/organs  but  also 
socio-political  organizations,  was  the  1975  Ludwig  von 
Bertalanffy  Lecturer,  J.G.  Miller  (1978).  Miller  notes  that 
there  are  seven  levels  of  living  systems,  from  the  cell  to  the 
‘supra-national  society  ’ ,  and  that  at  any  level  there  are  nineteen 
functional  subsystems,  the  central  one  of  which  is  the  system’s 
decider.  Since  any  algorithm  is  a  recipe  for  a  decision-making 
process.  Miller  unknowingly  [cf.  Mihram  (1979)]  had  uncov¬ 
ered  the  preference  for  algorithmic,  as  opposed  to  mathemati¬ 
cal,  models  among  biologists,  sociologists,  and  sociobiolo¬ 
gists  as  well. 

2  The  Algorithm 

Wheatley  and  Unwin  (1972)  made  quite  explicit  what  Mihram 
(1970)  had  suggested  quite  strongly:  viz.,  that  algorithmic 
modelling  is  distinctly  different  from  models  written  in  the 
language  of  mathematics: 

An  algorithm  is  a  mathematical  recipe.  From  this,  its 
meaning  has  been  extended  to  cover  a  recipe  in  any 
field  of  activity. 

Wheatley/Unwin  (1972) 

This  distinction  between  the  algorithm  and  mathematics  is, 
however,  quite  grammatical  [cf.  Mihram  (1973)]:  the  algo¬ 
rithm  is  a  second-person  expression,  or  command,  whereas  a 
mathematical  statement  is  expressed  in  the  third  person  (e.g., 
F  =mXa). 

The  pertinence  of  the  distinction  to  biologists,  how¬ 
ever,  lies  in  Miller’s  revelation  (1978)  that  every  living  system, 
no  matter  how  small  or  complex,  contains  as  its  central 
subsystem  its  decider: 

the  executive  which  receives  information  from  all  the 
other  subsystems  and  transmits  to  them  information 
outputs  that  control  the  entire  organization. 

Miller  (1978) 

Thus,  if  one  is  to  capture  the  dynamics  of  any  living  system  in 
terms  of  a  computerised  model,  one  would  do  well  to  employ 
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the  algorithmic  (as  opposed  to  mathematical)  constmction. 
The  algorithm  is  ideally  suited  for  capturing  the  dynamics  of 
any  living  system  because  it  can  precisely  describe  the  condi¬ 
tions  under  which  a  change  is  made,  a  decision  or  choice  is 
enacted. 

3  Exemplary  Systems 

Many  researchers  in  AI  take,  nonetheless,  the  mathematical 
approach:  e.g.,  researchers  dealing  with  neural  networks  (cf., 
e.g.,  Newman’s  paper  in  these  1991  proceedings  and  Gagliano 
et  al  (1991)]  express  their  models  in  mathematics,  then  use 
computer  algorithms  to  exercise  a  particular  solution  to  these 
mathematical  relationships. 

This  is  the  same  approach  used  by  the  authors  (e.g., 
Forrester  and  the  Meadows-es)  of  the  once-highly-touted 
“world  models”:  viz.,  describe  the  world’s  economic  develop¬ 
ment  in  terms  of  differential,  or  difference,  equations,  then  go 
solve  (arithmetically  evaluate)  this  ‘system’  of  time-depen¬ 
dent  equations  on  a  computer  [Mihram  (1974a)].  Unfortu¬ 
nately,  here  the  underlying  algorithms  mime  the  passage  of 
time  by:  (a)  computing,  from  the  present  status,  the  status  at  the 
next  step  of  time;  and  (b)  advance  time  by  one  unit;  then,  (c) 
using  the  same  algorithms,  re-compute  the  next  status . 

Unfortunately,  such  an  approach  fails  to  capture  the 
quite  erratic  dynamics  of  any  living  system:  one  needs  to  write 
an  algorithm  which,  like  the  particular  living  system  which  it 
describes,  is  activated  not  regularly  but,  rather,  if  and  when 
required. 

The  algorithmic,  as  opposed  to  the  mathematical, 
among  computerised  models  is  thus  far  better  suited  to  capture 
with  scientific  credibility  the  dynamics  of  any  living  system 
(or,  of  any  system  containing  at  least  one  living  component). 

The  researchers  dealing  with  neural  networks  via 
their  mathematical  models  typically  are  describing  motor 
activities  of  the  living  system;  however,  artificial  intelligence 
researchers,  attempting  to  capture  the  decision-making  capa¬ 
bilities  of  a  living  organism,  are  finding  that  the  algorithm  is 
much  better  suited  to  their  task  than  is  mathematics,  notwith¬ 
standing  the  negativistic  approach  of  writers  like  Winograd/ 
Flores  [cf.  Mihram,  1989]. 

As  a  further  example,  consider  the  currently  increas¬ 
ing  use  of  bibliographic  retrieval  systems  in  major  research 
libraries.  These  software  packages,  or  computer  programmes, 
are  in  actuality  simulation  models  of  a  librarian-researcher 
team  seeking  pertinent  literature  citations  on  a  specified  logi¬ 
cal  combination  of  subjects.  The  models  become,  in  effect,  an 
Al  model  of  a  librarian  or  researcher  at  his/her  task.  They  are 
not  mathematical,  but  they  do  describe  the  reason  why  algo¬ 
rithmic  models  are  much  better  suited  forcapturing  the  dynam¬ 
ics  of  any  living  system  than  is  mathematics:  the  decisions  are 
described  precisely  by  algorithms,  not  by  mathematical  ex¬ 


pressions. 

4  Concluding  Remarks 

The  history  of  science  actually  reveals  that  one  need  not  use 
mathematics  in  order  to  qualify  as  a  scientist.  Newton  may 
well  have  given  mathematics  an  esteemed  place  among  lan¬ 
guages  used  by  scientists,  and  the  French  philosophers/math¬ 
ematicians/scientists  of  the  early  nineteenth  century  only  en¬ 
hanced  this  image  [cf.,  e.g.,  Mihram,  1991]  when  they  virtually 
‘institutionalized’  the  notion  of  scientific  method  as  being  no 
more  than  the  theorem-proving  mechanism  of  mathemati¬ 
cians. 

Ampere  and  these  other  early  nineteenth-century 
scientists  were  in  actuality  only  serving  to  confirm  the  correct¬ 
ness  of  Newton’s  laws:  they  first  accepted/assumed  that  New¬ 
ton  was  correct,  then  assumed  (like  the  geometry  student  in 
quest  of  the  terminating  ‘QED’)  that  matter  is  particulate  in  its 
character,  and  then  by  mathematical  argumentation  derived 
results  (such  as  the  inverse-square  laws  of  electricity  and 
magnetism). 

However,  scientists  (and  biologists,  particularly) 
should  recall  the  success  (also  in  the  nineteenth  century)  of 
Charles  Darwin.  His  ORIGIN  OF  THE  SPECIES,  if  it  were 
not  for  the  editorial  insertion  of  the  pagination  sequence, 
contains  virtually  no  mathematics.  As  importantly,  they  should 
heed  the  message  of  Nobel  Laureate  Konrad  Lorenz: 

The  Fashionable  Fallacy  [Today]  of  Dispensing  with 
Description  [in  Favour  of  Mathematics]" ..."  1  have 
never  in  my  life  published  a  book  or  a  paper  with 
either  a  table  or  a  graph  in  it. 

Lorenz,  1973 

Scientists  who  convey  their  model  of  the  reality  which  they 
have  observed  may  choose  a  natural  language  (the  first-person 
format:  a  la  Darwin),  the  language  of  mathematics  (the  third- 
person  format:  a  la  Newton),  or  computer  programming  (the 
second-person  format).  The  decision/choice  must  not  be  a 
mere  predisposition,  but,  rather,  a  result  of  a  reflexion  [cf. 
Mihram  and  Mihram,  1984;  Mihram,  1974b]  on  the  intrinsic 
character  of  the  natural  phenomenon,  or  system  of  phenomena, 
being  studied/observed.  Are  deciders  to  be  mimed? 
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Abstract 

We  describe  a  method  for  using  pseudo  realizations  of 
data  to  check  the  validity  of  the  simple  bootstrap  anal¬ 
ysis  considered  in  Efron  (1979).  A  simulation  study  is 
performed  to  demonstrate  the  usefulness  of  the  proposed 
method. 

1  INTRODUCTION 

Simple  bootstrap  analysis,  as  described  in  Efron  (1979), 
gives  nonparametric  estimates  of  accuracy  of  statistic  of 
interest.  Major  advantages  of  a  bootstrap  analysis  are 
its  simplicity  and  its  use  at  the  case  when  the  analytic 
method  is  intractable.  However,  it  is  not  clear  in  general 
whether  a  bootstrap  analysis  is  valid.  Refer  to  Bickel 
and  Freedman  (1981)  for  examples  that  the  bootstrap 
analysis  fails.  In  view  of  this,  an  algorithm  is  proposed  in 
this  article  for  using  pseudo  realization  of  data  to  check 
its  validity.  The  specifics  of  this  algorithm  is  given  in 
Section  2. 

Let  us  start  with  a  brief  review  of  the  one-sample 
simple  bootstrap  analysis.  Suppose  the  quantity  of  in¬ 
terest  is  0{F),  which  is  a  parameter  of  unknown  dis¬ 
tribution  F.  Let  s(x„)  be  an  estimate  of  0{F)  based 
on  x„,  where  x^  =  (^i,  •  •  ■ ,  Xn)  denotes  a  realization  of 
random  sample  X„  =  (Xi, . . .  ,Xn)  from  F.  We  then 
need  to  assess  the  accuracy  of  s(Xn)  as  an  estimator  of 
0(F).  In  this  article,  the  measure  of  accuracy,  <f>,  will 
always  be  referred  to  as  the  ibth  percentile  of  the  dis¬ 
tribution  of  \/n[s(X„)  -  ^(F)]  when  the  distribution  of 
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V^(s(Xn)  —  ^(F)]  is  nondegenerate.  However,  the  pro¬ 
posed  algorithm  is  also  applicable  to  other  measure  of 
errors,  such  as  standard  error. 

In  this  case,  the  bootstrap  estimate  of  4  is  the  cor¬ 
responding  fcth  percentile  of  ^/n[s(X*)  —  S(Fn)].  Here 
F„  is  the  usual  empirical  distribution  function  based  on 
X„  and  s(X*)  is  the  corresponding  estimate  based  on 
the  bootstrap  sample  X*  =  (X*,...,X’),  which  is  a 
random  sample  of  size  n  from  F„. 

The  proposed  algorithm  is  motivated  by  the  following 
argument.  Suppose  that  we  have  two  observed  samples 
of  size  n  from  F.  Denote  them  by  x„i  and  x„2,  respec¬ 
tively.  Although  <t>  is  unknown,  it  is  a  fixed  number. 
When  a  bootstrap  analysis  is  a  valid  one,  the  two  boot¬ 
strap  estimates  of  ^  based  on  x„i  and  Xn2,  respectively, 
should  not  be  too  different.  In  other  words,  the  vari¬ 
ability  of  bootstrap  estimate  of  4>  over  realizations  of  X„ 
should  be  “small”  compared  to  <j>  when  the  bootstrap 
analysis  is  valid. 

In  summary,  a  bootstrap  analysis  is  not  valid  if  the 
bootstrap  estimate  of  ^  varies  “dramatically”  over  real¬ 
izations  of  Xn .  Therefore,  the  accuracy  or  the  sensitivity 
of  bootstrap  estimate  of  4>  over  realizations  of  Xn  should 
be  analyzed  before  reporting  the  bootstrap  statistics. 

However,  a  major  hurdle  in  observing  the  variability 
of  bootstrap  statistics  over  Xn  is  that  the  statistician 
has  available  only  one  realization  of  Xn ,  Xn .  Hence  we 
propose  to  generate  “pseudo”  realizations  of  Xn  based 
on  a  smoothed  estimate  of  F  to  get  an  estimate  of  the 
variability  of  the  bootstrap  estimate  of  4>.  This  algorithm 
can  be  called  smooth  bootstrap- after- bootstrap  accord¬ 
ing  to  Efron  (1990b).  This  idea  is,  strictly  speaking,  not 
new.  It  is  just  another  application  of  the  bootstrap.  This 
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problem  is  also  considered  in  Efron  (1990b).  It  suggestes 
to  use  the  jackknife  method  to  estimate  the  variability  of 
the  bootstrap  estimate  of  <i>-  This  leads  to  the  so-called 
jackknife-after-bootstrap  method. 

2  Proposed  Algorithm 

According  to  the  discussions  in  Section  1,  the  following 
algorithm  is  proposed  to  assess  the  “accuracy”  of  the 
simple  bootstrap  analysis.  This  algorithm  proceeds  in 
three  steps: 

Step  1.  Construct  a  smoothed  estimate  of  F,  F,n.  (See 
Section  3  for  discussions  on  the  construction  of  F,n  ) 
Step  2.  Draw  B  random  sample  of  size  n  from  F,„,  say 
for  1  <  6  <  S 

^sbi  ”  ^tbi  1  ^Jbi  ^ind  F jn  i  —  1 ,  .  .  . ,  n. 

Call  these  the  test  bootstrap  samples,  Xji,  = 

( Afftl ,  .  .  .  ,  ^$bn'}i  and  bn)* 

Step  3.  For  each  test  bootstrap  sample  x,b,  find  its  boot¬ 
strap  estimate  of  and  then  study  the  variability  among 
those  B  bootstrap  estimates  of  <p. 

When  a  "significant”  variation  among  the  B  boot¬ 
strap  estimates  of  ^  is  found,  it  indicates  that  the  result 
of  bootstrap  analysis  is  dubious.  Let  F„i  be  the  empiri¬ 
cal  distribution  function  based  on  x.j.  Since  F„t  lies  in 
a  neighborhood  of  F,„ ,  the  found  “significant”  variation 
means  the  lack  of  uniformity  over  the  above  mentioned 
neighborhood.  Hence,  we  may  cast  doubt  on  the  useful¬ 
ness  of  bootstrap  analysis  since  F,„  lies  in  a  small  neigh¬ 
borhood  of  F.  No  specific  recipe  on  measuring  variation 
is  given  in  this  article.  See  Sections  3  and  4  for  further 
discussions  in  this  regard. 

If  F,n  is  replaced  by  F„  at  Step  1  of  the  proposed  al¬ 
gorithm,  the  implementation  of  the  proposed  algorithm 
is  almost  identical  to  the  implementation  of  nested  dou¬ 
ble  bootstrap  algorithm  in  Deran  (1987)  and  others. 
However,  these  two  algorithms  are  proposed  with  totally 
different  rationale.  The  nested  double  bootstrap  is  pro¬ 
posed  to  improve  the  bootstrap  estimate  of  4>  when  tlic 
bootstrap  works,  but  the  proposed  algorithm  is  used  to 
estimate  the  variability  of  the  bootstrap  estimate  of 

3  Discussion 

An  algorithm  is  proposed  for  evaluating  the  variability 
of  a  bootstrap  analysis  over  realizations  of  X„  A  prac¬ 


tical  disadvantage  of  this  algorithm  is  that  it  is  compu¬ 
tationally  expensive.  As  a  rough  guide,  the  execution 
time  of  the  proposed  algorithm  is  roughly  equal  to  the 
execution  time  of  evaluating  the  bootstrap  method  by  a 
Monte  Carlo  experiment.  Through  various  reports  in  the 
literature  and  the  greater  availability  of  fast  computer, 
the  computational  cost  should  not  be  a  big  problem  in 
today’s  computing  environment. 

In  the  implementation  of  proposed  algorithm,  four 
issues  are  needed  to  be  addressed.  Namely, 

1.  the  prescription  of  F,n  (in  Step  1), 

2.  the  choice  of  B  (in  Step  2), 

3.  the  computation  of  bootstrap  statistics  (in  Step  3), 
and 

4.  the  variability  of  bootstrap  statistics  (in  Step  3). 

For  the  first  issue,  there  are  various  methods  to  con¬ 
struct  a  smooth  probability  density  in  the  density  esti¬ 
mation  literature.  The  smooth  probability  distribution 
F,„  can  be  then  obtained  by  an  appropriate  integration. 
Two  natural  questions  are  then  raised.  They  are  smooth 
distribution  function  versus  empirical  distribution  func¬ 
tion  and  the  choice  of  smoothing  scheme.  For  the  first 
question,  refer  to  Hall,  DiCiccio,  and  Romano  (1989) 
and  references  therein.  As  a  remark,  the  use  of  smooth 
probability  distribution  may  not  be  appropriate  if  A' 
is  a  discrete  random  varible.  For  the  second  question, 
we  are  investigating  the  smoothing  scheme  based  on  the 
logspline  density  estimate  in  Stone  and  Koo  (1986).  The 
result  will  be  reported  elsewhere.  An  advantage  of  log- 
spline  density  estimate  over  other  smoothing  schemes  is 
that  most  widely  used  density  functions  are  of  the  form 
log-spline. 

For  the  second  issue,  we  suggest  to  let  B  be  around 
r»(rj  —  l)/2  based  on  the  following  reason.  When  the 
jackknife-after-bootstrap  method  is  used,  the  accuracy 
measure  of  bootstrap  statistics  is  obtained  by  repeat¬ 
edly  deleting  a  single  observation.  On  the  other  hand, 
x,»  may  contain  any  number  of  x,-  with  different  prob¬ 
abilities  among  those  B  pseudo  realizations  of  x  if  F„ 
is  in  place  of  F,„  in  the  proposed  algorithm.  Further¬ 
more,  Theorem  6.1  of  Efron  (1982)  attempts  to  view  the 
jackknife  as  a  linear  approximation  to  the  bootstrap.  As 
it  is  known,  the  jackkbife  method  may  have  trouble  for 
markedly  nonlinear  statistics.  To  avoid  the  proposed 
algorithm  to  be  reduced  to  the  jackknife-after-bootstrap 
method,  we  would  like  to  choose  B  large  enough  to  guar¬ 
antee  that  these  pseudo  realizations  should  include  some 
of  the  “delete-many”  samples. 
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For  the  third  issue,  it  is  known  that  one  cannot  usu¬ 
ally  compute  analytically  the  bootstrap  statistics  except 
in  special  cases  or  in  small  samples.  A  viable  alternative 
is  to  approximate  the  bootstrap  distribution  numerically 
by  means  of  a  Monte  Carlo  sampling.  When  this  method 
is  used,  the  number  of  Monte  Carlo  sampling  will  be 
constrained  by  the  available  computing  power.  It  then 
raises  the  question  on  how  many  bootstrap  replications 
must  be  taken  to  insure  that  the  observed  variability  at 
Step  3  does  not  come  from  the  randomness  added  by  the 
Monte  Carlo  sampling.  Note  that  a  bootstrap  sample  is 
the  same  as  a  random  sample  of  size  n  drawn  with  re¬ 
placement  from  the  actual  sample  x„ .  For  simple  statis¬ 
tics,  some  obvious  estimate  on  the  variability  of  Monte 
Carlo  sampling  can  be  obtained  based  on  Berry-Esseen 
type  bound.  It  turns  out  that  a  large  number  of  replica¬ 
tions  may  be  needed.  However,  Efron  (1990a)  suggested 
that  between  1000  and  2000  replications  are  to  be  appro¬ 
priate.  Currently,  we  are  investigating  this  error  bound 
along  the  line  of  Efron  (1990b). 

For  the  last  issue,  the  variability  can  be  revealed  by 
various  available  exploratory  data  analysis  tools  such  as 
the  one  used  in  Section  4.  Formal  tests  similar  to  these 
proposed  in  Nair  (1982)  can  also  be  used. 


4  Simulation  Study 

A  Monte  Carlo  study  was  performed  to  demonstrate 
the  usefulness  of  the  proposed  algorithm  in  Section  2. 
For  simplicity,  we  consider  the  estimation  of  popula¬ 
tion  mean  by  sample  mean  based  on  50  observations. 
The  two  cases  considered  for  F  are  Normal  and  Cauchy, 
and  the  measure  of  accuracy  is  various  percentiles.  The 
specific  percentiles  considered  here  are  5th,  10th,  16th, 
and  32nd.  Since  the  density  function  of  Normal  is  of 
the  form  log-spline  but  the  density  function  of  Cauchy 
is  not,  we  replace  F,„  in  Step  1  by  in  this  Monte 
Carlo  study  to  avoid  a  possible  bias  toward  the  pro¬ 
posed  algorithm.  Therefore,  the  algorithm  used  here 
is  the  bootstrap-after-bootstrap  method  instead  of  the 
smooth  bootstrap-after-bootstrap  method  as  described 
in  Section  2.  Also,  a  Monte  Carlo  algorithm  is  used  in 
Step  3  to  find  the  bootstrap  estimate  of  4>. 

For  each  realization  of  50  observations,  B  (in  Step  2) 
is  set  to  be  1000  and  the  number  of  bootstrap  replication 
(in  Step  3)  is  2000.  Results  are  then  summarized  in  Fig¬ 
ures  2  and  4  for  the  first  realization.  In  these  figures,  the 
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plotted  curves  from  left  to  right  are  the  estimated  density 
functions  for  those  four  percentiles  arranged  in  ascend¬ 
ing  order.  Each  curve  is  obtained  by  applying  the  kernel 
smoother  over  1000  bootstrap  percentiles.  For  kernel 
smoother,  a  triangular  kernel  is  used  and  the  bandwidth 
is  set  to  be  one-quarter  of  the  sample  range  of  these  1000 
bootstrap  percentiles.  However,  Figure  4  is  constructed 
without  the  normalizing  factor  y/EO.  This  experiment  is 
then  repeated  for  another  99  times.  The  characteristics 
of  all  figures  from  the  next  99  realizations  are  similar 
to  Figures  2  and  4  correspondingly.  Here  the  character¬ 
istics  refers  to  the  amount  of  overlapping  among  these 
four  density  functions  and  the  shape  of  the  density  func¬ 
tions.  However,  the  “center”  of  these  curves  does  vary. 
For  example,  the  median  of  the  estimated  5th  percentile 
density  function  over  these  100  experiments  ranges  over 
[—1.119, —2.078]  when  F  is  Normal. 

In  order  to  check  whether  the  estimate  of  the  vari¬ 
ability  of  the  bootstrap  estimate  of  <f>  obtained  from 
bootstrap-after-bootstrap  is  close  to  the  variability  of  the 
bootstrap  estimate  of  <t>,  we  compute  bootstrap  statistics 
based  on  2000  replications  for  100  realizations  of  50  ob¬ 
servations.  Figures  1  and  3  summarize  the  result  from 
those  100  realizations.  They  are  constructed  in  the  same 
fashion  of  as  Figures  2  and  4.  Figure  3  shows  clearly 
that  the  four  estimated  density  functions  have  a  concen¬ 
tration  around  [-10,0]  and  spread  over  a  wide  range  of 
values.  These  just  reflect  the  fact  that  there  are  a  few 
wild  outliers  presented  in  most  realizations  of  50  obser¬ 
vations. 

Based  on  Bickel  and  Freedman  (1981)  and  Knight 
(1989),  the  bootstrap  analysis  is  useful  for  Normal  but 
is  not  good  for  Cauchy.  Figures  2  and  4  confirm  it.  The 
dissimilarity  between  Figure  3  and  Figure  4  suggests  that 
the  estimate  of  the  variability  of  the  bootstrap  estimate 
of  (j)  obtained  from  bootstrap-after-bootstrap  is  not  nec¬ 
essary  equal  to  the  variability  of  the  bootstrap  estimate 
of  (p. 

In  summary,  the  proposed  algorithm  has  the  poten¬ 
tial  of  revealing  whether  a  bootstrap  analysis  is  a  valid 
one.  But  the  proposed  estimate  of  the  measure  of  ac¬ 
curacy  is  not  necessary  close  to  the  unknown  measure 
of  accuracy,  4>-  This  again  confirms  that  the  bootstrap 
analysis  may  fail  sometimes  although  it  is  a  quite  useful 
method. 
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Abstract 

Quasi-random  sequences  are  known  to  give  efficient  nu¬ 
merical  integration  rules  in  many  Bayesian  statistical  problems 
where  the  posterior  distribution  can  be  transformed  into  pe¬ 
riodic  functions  on  the  n-dimensional  hypercube.  From  this 
idea  we  develop  a  quasi-random  approach  to  the  generation 
of  resamples  used  for  Monte  Carlo  approximations  to  boot¬ 
strap  estimates  of  bias,  variance  and  distribution  functions.  We 
demonstrate  a  major  difference  between  quasi-random  boot¬ 
strap  resamples,  which  are  generated  by  deterministic  algo¬ 
rithms  and  have  no  true  randomness,  and  the  usual  pseudo¬ 
random  bootstrap  resamples  generated  by  the  classical  boot¬ 
strap  approach.  Various  quasi-random  approaches  are  consid¬ 
ered  and  are  shown  via  a  simulation  study  to  result  in  approx- 
imants  that  are  competitive  in  terms  of  efficiency  when  com¬ 
pared  with  other  bootstrap  Monte  Carlo  procedures  such  as  bal¬ 
anced  and  antithetic  resampling. 

1.  Introduction 

Let  ,V  =  { A'l , . . . ,  A'n }  denote  a  random  sample  of  size  n, 
write  f  for  a  function  of  these  data,  and  let  'f  *  represent  the 
same  function  of  the  data  in  a  sample  ,V*  =  { A'f , . . . ,  A'* } 
drawn  randomly  from  X,  with  replacement.  Thus.  ,V*  is  a 
uniform  resample.  The  bootstrap  estimate  of  t  =  E{f)  is 
t  =  £’(f'*|,V).  In  the  event  that  the  A',  ’s  are  vectors,  assume 
that  we  can  write  T  =  g(X)  for  a  smooth  function  g,  where 
A'  denotes  the  mean  of  X.  Let  bracketed  superscripts  de¬ 
note  indices  of  vector  elements,  and  put  gj(x)  =  dg(.i)/dx^\ 
CiXj)  =  Ij  A'--’’(7j(A').  We  begin  by  describing  an  algorithm 
for  constructing  a  quasi-Monte  Carlo  approximation  to  t. 

First,  sort  the  n  data  values  in  ,V,  obtaining  ,V  = 
{A'(i),...,  A'(„)}  where  G(A(i))  <  ...  <  G(A'(„)).  (Alter¬ 
natively,  we  could  ask  that  G(A'(i))  >  ...  >  G(A'(„)).  If  the 
sample  ,V  is  univariate  then  we  may  order  the  .sample  values  di¬ 
rectly.  and  not  pass  to  the  function  G(A'(,)).)  Let  B  denote  the 
number  of  bootstrap  resamples  and  let  U(  =  (u^\ ....  Ut"’), 
1  <  6  <  B,  represents  points  in  the  n-dimensional  hypcrcube 
C„  =  (0, 1)"  generated  by  a  quasi-random  algorithm,  which 
we  shall  describe  in  section  2.  Transform  the  U(,’s  into  a  set  of 


index  vectors  ....  ij"') ,  1  <  6  <  fl,  by 

,t’’=[l-Fnu^^]  6=1,...,S;  ;  =  1 . . 

where  [r]  denote  the  largest  integer  not  exceeding  x.  Then 
each  is  an  integer  between  1  and  n.  Conditional  on  ,V.  let 

X^ ,...,Xq  denote  quasi-random  resamples  defined  by 
~  ,  •  •  •  ,  A  6  =  1, . . . ,  B. 

Thus  by  using  the  idea  of  selecting  points  according  to 
a  deterministic  scheme  that  is  well-suited  for  numerical  in¬ 
tegration,  we  develop  a  quasi-random  approach  to  bootstrap 
resampling.  Our  contributions  arc  twofold:  (i)  we  expand 
the  scope  of  usefulness  of  quasi-random  methods  to  other 
computer-intensive  areas,  in  particular  we  familiarize  “boot- 
strappers  with  this  school  of  thought;  (ii)  we  explore  possible 
efficiency  gains  (over  pseudo-random  resampling)  in  using  dif¬ 
ferent  types  of  quasi-random  resampling. 

2.  Quasi-random  sequences 

The  terminology  given  here  is  not  always  standard  but  has 
been  found  to  be  the  easiest  for  distinguishing  the  nature  of 
quasi-random  sequences.  We  shall  consider  regular  quasi¬ 
random  sequences  generated  by 

U6+1  =  Ui  +  ot(modl),  (1) 

where  Ui  is  a  fixed  or  random  point  in  the  n -dimensional  hy¬ 
percube.  Note  that  the  jth  coordinate  in  U(  is  the  fractional 
part  of  the  jth  coordinate  in  u^  +a.  Regular  sequences  are  dis¬ 
tinguished  as  rational  or  irrational  according  as  a  is  a  vector 
consisting  of  only  rationals  or  only  irrationals.  We  al.so  con¬ 
sider  irregular  or  quasi-random  sequences  generated  by  other 
forms  of  algorithm  and  include  pseudo-random  sequences.  As¬ 
sessment  of  how  "good"  a  deterministic  sequence  is  can  of¬ 
ten  be  expressed  in  terms  of  its  discrepancy.  The  discrepancy 
measure  provides  a  bound  to  the  integration  (i.e.  expectation) 
error  in  numerical  integration,  provided  the  function  to  be  in¬ 
tegrated  is  of  bounded  variation.  In  the  bootsbap  framework. 
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approximation  of  bias,  variance,  and  distribution  functions  are 
generally  based  on  well-behaved  functions.  Therefore,  it  is  an¬ 
ticipated  that  low-discrepancy  sequences  for  integration  rules 
will  provide  bootstrap  approximants  with  high  accuracy. 

To  construct  low-discrepancy  sequences,  Hlawka  (1%2) 
used  the  so-called  method  of  good  lattice  points  which  takes 
account  of  the  regularity  of  the  function  /,  in  addition  to  the 
bounded  variation  property  of  /.  Hlawka  considered  the  case 
of  B  being  prime  and  k  a  point  with  integral  coordinates.  Let 
R(k)  be  the  set  of  all  non-zero  vectors  h  such  that  k  •  h  = 
O(modB).  Define 


n 

r(h)  =  ri'^axd,  |/i  I);  p(k)=  min  r(h). 

^.1  hewik) 

Hlawka  called  the  lattice  point  k  good  modulo  B  if 

p(k)  >  B{81ogfl)-^ 


and  proved  the  existence  of  good  lattice  points  modulo  any 
prime  B.  Zaremba  (1966)  showed  that  in  the  case  n  =  2,  even 
“better"  lattice  points  corresponding  to  a  larger  ^(k)  can  be  ob¬ 
tained  where  B  does  not  need  to  be  prime.  Niedcrreitcr  (1977) 
improved  on  Zaremba 's  result  for  an  arbitrary  dimension  n. 
Shaw  (1988)  considered  another  meisure  of  distance  defined 
as 

n 


t'(k)=  min 

h6fi(k)~ 


where  i/  is  an  upper  bound  to  the  minimum  number  of  par¬ 
allel  (n  -  l)-dimensional  hyperplanes  covering  the  sequence 
ui , . . . ,  Ug.  We  shall  discuss  below  the  construction  of  several 
different  types  of  quasi-random  sequences  and  their  properties. 
•  Rational  sequences 

SEQUENCE  1.  Here  the  uj’s  are  xs  defined  in  (1).  where 


a=  (B-',B-‘/t,B-'it^modB . B-'it"-‘modB). 


The  construction  of  this  sequence  is  based  on  the  method  of 
good  lattice  points.  When  n  =  2,  it  is  possible  to  explicitly 
construct  good  lattice  points  by  using  continued  fractions. 

•  Irrational  sequences.  These  can  be  generated  using  a 
method  closely  related  to  that  of  rational  sequences.  Here  the 
U(,’s  are  xs  defined  in  (1)  where  a  is  an  irrational  point  of  the 
form  a  =  (rti, .. . ,  <>„)  and  1,  f»i,  •••,  On  are  linearly  indepen¬ 
dent  over  the  rationals.  Davis  (1963,  pp.356-457)  proved  the 
equidistribution  property  for  these  sequences.  Let  pi ,  p2,  •  •  •  be 
the  sequence  of  prime  numbers  1,3,5,7,11,...  and  p  be  .some 
prime.  A  number  x  is  said  to  have  order  ymodr  if /S'  =  Imodc 
and  X*  y  1  for  1  <  it  <  {/. 

In  our  bootstrap  simulation  study,  we  consider  equidis- 
tributed  irrational  sequences  by  using  a  xs  described  below. 
SEQUENCE  2 


SEQUENCrE  3 

where ^  =  p"*', 

SEQUENCE  4 

25r  _  47r  „  2irn\ 

a=  1 2  cos  —  ,2  cos  —  ,...,2  cos - ), 

V  p  p  p  y 

where  p>  2n  + 3  and  satisfies  either  (i)  2  has  order  p  -  1  modp 
or  (ii)  2  has  order  (p  -  l)/2modp  and  p  =  7mod8. 

•  Irregular  sequences.  We  focus  attention  specifically  on 
three  irregular  sequences  that  have  been  used  successfully  in 
integration  problems. 

SEQUENCE  5.  (Haber  sequence) 

/b(b+l)  ^  b(b+l)  _N  ^ 

Oi,=  [ — 2 — v/^'  — 2 — 

SEQUENCE  6.  (Hammersley  sequence) 

Ui  =  (B~'b,  <i>pfb) —  ,  <^>p,_,(b)), 

where  pi,...,pn  are  the  first  n  -  1  primes  and  <fip(b)  is  the 
radical  inverse  function  of  b  to  the  base  p  (a  rigorous  definition 
is  given  below). 

SEQUENCE  7.  (Halton  sequence) 

Ub  =  {4>ptib),  <i>pi(b), . . . ,  <i>p„(b)). 

The  function  <i>p(b)  is  the  rational  inverse  function  of  b  to  the 
base  p,  obtained  by  taking  the  p-ary  representation  of  the  num¬ 
ber  6  and  reflecting  the  digits  about  the  decimal  point. 

3.  Simulation  Study 

In  this  section  we  summarize  the  results  of  a  simulation 
study  of  the  performance  of  quxsi-random  resampling  relative 
to  uniform  resampling.  We  applied  our  method  to  the  prob¬ 
lems  of  estimating  bias  and  variance  when  T(X)  =  A'^  or 
T(X)  =  \/i^.  and  of  estimating  the  distribution  of  the  Stu- 
dentized  mean.  Let  T  be  the  numerical  value  of  the  statis¬ 
tic  of  interest  calculated  from  the  original  sample,  and  let 
Tj  be  the  corresponding  value  calculated  from  the  tih  boot¬ 
strap  resample.  Bias,  variance  and  the  distribution  function 
^(x)  =  P(r*  <  x|,V)  can  be  estimated  by 

bias  =  T*  -  r, 

vaf  =  B-'5^(r;-T;)', 

6-1 

B 

F(x)  =  B-'53/(r;  <x). 

6-1 

where  T'  =  B~'l.bT^ . 

Consider  the  problem  of  estimating  ^(x).  We  calculated 
using  KXl.OOO  uniform  resamples.  Let  Ft  (x)  and  Fq(x) 
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denote  our  approximations  to  /"(x)  using  B  uniform  resam¬ 
ples  and  B  quasi-random  resamples  respectively.  We  com¬ 
puted  Dq  =  {Fq(x)  -  P(x)}^  and  computed  the  average, 

Du,oi  {Ft;(x)  -  /'(x)}^  over  M  =  100  independent  repeats 
of  the  uniform  resampling  scheme,  for  a  given  sample.  Note 
that  we  do  not  need  to  average  Dq  over  Af  repeats  since  there 
is  only  one  deterministic  quasi-random  sequence  for  each  sam¬ 
ple.  We  then  averaged  Dq  and  Du  over  N  =  250  independent 
samples,  obtaining  dQ  and  dy,  say;  and  finally,  took  the  ratio 
r  =  du/dQ.  This  gave  a  measure  of  the  efficiency  of  quasi¬ 
random  resampling  relative  to  uniform  resampling  in  estima¬ 
tion  of  distribution  functions.  The  case  of  bias  and  variance 
estimation  can  be  treated  similarly,  with  obvious  analogues  for 
du  and  dQ.  It  was  observed  that  quasi-random  resampling  does 
not  perform  better  or  worse  than  quxsi-random  resampling  in 
the  problem  of  bias  estimation.  Therefore  the  tables  presented 
in  this  paper  will  concentrate  only  on  efficiencies  in  variance 
and  distribution  estimation. 

In  the  problem  of  distribution  estimation,  we  have  re¬ 
stricted  our  considerations  to  rational  sequences  only.  Rational 
sequences  perform  better  than  straight  random  sequences  at  all 
quantile  values.  They  exhibit  the  common  pattern  of  better  per¬ 
formance  towards  the  centre  of  the  distribution.  However,  ef¬ 
ficiency  gains  at  the  tails  are  still  impressive  and  surpa.ss  those 
obtained  from  balanced  and  antithetic  resampling,  especially 
when  the  parent  population  generating  .V  is  exponential. 

4.  Conclusions 

Regular  sequences  are  no  more  difficult  to  implement 
than  pseudo-random  sequences  and  usually  exhibit  consistent 
trends  in  efficiency  gains.  Bootstrap  resampling  based  on 
Haber  sequences  is  rather  disappointing  due  to  their  erratic  be¬ 
haviour,  but  quasi-random  resampling  based  on  radical  inverse 
functions  such  as  the  Hammerslcy  and  Halton  sequences  can 
yield  significant  efficiency  gains  for  large  B.  The  behaviours 
observed  here  for  irregular  sequences  are  in  close  agreement 
with  results  in  Shaw  (1988)  and  Wamock  (1972),  who  con- 
cendated  on  efficient  numerical  integration  rules.  It  should 
be  emphasised  that  the  problems  of  variance  and  distribution 
estimation  are  usually  of  more  practical  importance  than  bias 
estimation,  since  bias  is  generally  small  relative  to  standard 
deviation.  Therefore,  even  though  quasi-random  resampling 
does  not  provide  an  improvement  over  pseudo-random  resam¬ 
pling  in  problems  of  bias  estimation,  quasi-random  sequences 
remain  attractive  in  the  bootstrap  context  because  of  their  su¬ 
perior  performance  invariance  and  distribution  estimation.  We 
suggest  that  regular  sequences  be  applied  quite  generally  in 
bootstrap  resampling  problems,  although  greater  caution  is  rec¬ 
ommended  for  irregular  sequences.  A  more  rigorous  and  de¬ 
tailed  version  of  this  paper  is  available  from  the  author. 
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Efficiencies  for  variance  estimation  using  rational  sequences 


n 

B 

k 

Distribution 

r(,V)  =  A'^ 

T(X)  = 

10 

237 

10 

Normal  N(l.l) 

2.25 

1.81 

Exponential 

2.35 

2.17 

Folded  Normal 

2.35 

2. .^6 

10 

.342 

17 

Normal  N(l.l) 

2.15 

1.75 

Exponential 

2.2.3 

2.06 

Folded  Normal 

2.23 

2.76 

10 

610 

23 

Normal  N(  1.1) 

2.15 

1.87 

Exponential 

2.24 

2.19 

Folded  Normal 

2.,36 

2.29 
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Efficiencies  for  variance  estimation  using 
irrational  sequences 


n 

B 

Distribution 

Seq.  2 

Seq.  3 

Seq.  4 

10 

100 

Normal  N(l.l) 

1.34 

1.33 

1.03 

Exponential 

1.55 

1.37 

1.05 

Folded  Normal 

1.62 

1.39 

1.08 

10 

200 

Normal  N(l.l) 

1.39 

1.38 

1.00 

Exponential 

1.75 

1.41 

1.02 

Folded  Normal 

1.79 

1.44 

1.09 

10 

300 

Normal  N(l,l) 

1.89 

1.61 

1.11 

Exponential 

1.93 

1.69 

1.24 

Folded  Normal 

1.97 

1.75 

1.26 

10 

500 

Normal  N(l.l) 

1.96 

1.88 

1.39 

Exponential 

2.30 

2.00 

1.38 

Folded  Normal 

2.41 

1.97 

1.42 

Efficiencies  for  variance  estimation  using 
Haber  sequences 


n 

B 

Di.stribution 

T(,V)  = 

10 

100 

Normal  N(l,l) 

10.31 

2.20 

Exponential 

4.43 

5.80 

Folded  Normal 

7.67 

5.19 

10 

200 

Normal  Nfl.l) 

4.46 

1.37 

Exponential 

1.91 

5.74 

Folded  Normal 

2.84 

4.25 

10 

300 

Normal  N(l.l) 

3.22 

1.82 

Exponential 

2.87 

14.86 

Folded  Normal 

3.9 

6.58 

Efficiencies  for  vanance  estimation  using 
Hammersley  (S6)  and  Halton  (S7)  sequences 


n 

B 

Distribution 

7’(A') 

S6 

=  X^ 
S7 

T(A’) 

S6 

SI 

10 

500 

Normal  N(l.l) 

2.37 

0.46 

1.87 

1.89 

Exponential 

0.89 

0.21 

8.86 

6.75 

Folded  Normal 

1.54 

0.35 

5.61 

6.03 

10 

1000 

Normal  N(l.l) 

3.16 

0.53 

2.13 

2.31 

Exponential 

1.11 

0.33 

2.25 

3.79 

Folded  Normal 

2.41 

0.51 

2.38 

4.59 

10 

2000 

Normal  N(l,l) 

5.99 

0.65 

3.78 

4.01 

Exponential 

2.19 

0.39 

3.15 

4.50 

Folded  Normal 

2.60 

0.47 

3.00 

4.71 

Efficiencies  fordistni^ution  estimations  using 
rational  sequences 


n 

B 

k 

a: 

Distribution 

0.90 

1.282 

0.95 

1.645 

0.975 

1.96 

10 

237 

10 

Normal  N(l.l) 

2.55 

2.23 

1.88 

Exponential 

2.57 

2.34 

2.23 

Folded  Normal 

2.59 

2.23 

1.53 

10 

342 

17 

Normal  N{1.1) 

2.51 

2.10 

1.97 

Exponential 

2.47 

2.35 

2.15 

Folded  Normal 

2.49 

2.06 

2.01 

10 

237 

10 

Normal  N(l,l) 

2.42 

2.06 

1.74 

Exponential 

2.41 

2.25 

2.15 

Folded  Normal 

2.40 

2.07 

1.95 

Efficiencies  for  bias,  variance  ancT  distribution  estimations  using  rational 
sequences  in  comparison  to  balanced  and  antithetic  resampling 


T(A')  = 

yiA'i' 

T{A’)  =  v^(A'*  - 

A-)/s- 

Resampling  method 

a;  0.90 

0.95 

0,975 

Bias 

Var 

2^:  1.282 

1.645 

1.96 

Qucisi-random  using 
rational  sequence 
(n,B,k)  =  (10,237,10) 
Balanced 

1.00 

2.17 

2.57 

2.34 

2.23 

(n,B)  =  (10,500) 
Antithetic 

1.35 

0.70 

1.36 

1.11 

1.04 

(n,B)  =  (10,500) 

2.31 

1.12 

1.23 

1.06 

1.00 
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1  Abstract 

In  polarized  beam  studies,  an  asymmetry  statistic  of 
physical  interest  is  an  estimate  of  the  ratio  of  the  differ¬ 
ence  and  the  sum  of  the  Poisson  rate  parameters  for  two 
scattering  processes.  Typically,  an  additive  background 
signal  contributes  to  measurements  of  each  scattering  pro¬ 
cess.  Background  is  measured  in  a  third  experiment. 
Data  is  corrected  by  subtracting  measured  background. 
When  the  measured  background  is  larger  than  one  of  the 
other  measurements,  the  asymmetry  computed  from  the 
background  corrected  data  is  nonsensical.  For  such  cases, 
true  asymmetry  and  an  associated  conservative  interval 
are  estimated  using  a  bootstrap  procedure.  Bootstrap 
replications  of  the  observed  data  satisfy  a  constraint  that 
insures  physically  meaningful  results. 


2  Introduction 

In  many  areas  of  research,  asymmetry  statistics  are 
of  physical  interest.  For  example,  in  atomic  collision 
physics,  asymmetry  statistics  computed  from  the  scatter¬ 
ing  of  spin-polarized  electrons  from  atoms  carry  informa¬ 
tion  about  atomic  structure  (McClelland,  et.  al.  1989). 
In  materials  science  studies,  maps  of  magnetic  microstruc¬ 
ture  are  made  based  on  the  polarization  of  secondary  elec¬ 
trons  emitted  from  the  material  after  it  is  bombarded  by 
an  energetic  beam  of  electrons  (Scheinfein,  et.  al.  1990). 
To  estimate  these  polarizations,  asymmetry  statistics  are 
computed. 

Many  of  the  experiments  in  which  asymmetries  are  of 
interest  involve  the  counting  of  electrons  or  other  parti¬ 
cles.  Generally,  streams  of  pulses  (assumed  to  be  Poisson 
distributed)  are  counted  in  two  experiments,  one  for  each 
orientation  of  the  spins  in  the  system.  The  number  of 
counts  measured  in  each  experiment  is  associated  with 
an  intensity  for  each  of  the  two  spin  orientations.  The 
asymmetry  is  estimated  by  taking  the  ratio  of  the  differ¬ 
ence  and  background  corrected  sum  of  the  two  intensities. 


Suppose  that  the  number  of  scattering  events  for  two 
different  spin  orientations  are  measured  in  two  indepen¬ 
dent  experiments.  Further,  assume  that  each  experiment 
lasts  the  same  amount  of  time  t.  This  assumption  can 
be  relaxed  with  out  loss  of  generality.  The  first  observa¬ 
tion  Ni  can  be  expressed  as  the  sum  of  two  unobservable 
quantities  as  follows. 


Ni  =  Nr  +  Necy  (1) 

Above,  Ni’  represents  what  would  have  been  observed 
if  there  had  been  no  background.  The  number  of  counts 
due  to  the  background  is  Nbo,!*-  The  terms  on  the  right 
hand  side  of  Eq.  I  are  realizations  of  Poisson  processes 
with  parameters  Ai<  and  Xsct-  The  second  measurement 
is  expressed  as 


Ni  =  Nr  +  Nscy  (2) 


where  the  two  terms  on  the  right  side  of  Eq.  2  are  inde¬ 
pendent  realizations  of  Poisson  processes  with  parameters 
Xit  and  XbgI-  The  goeil  is  to  estimate  the  asymmetry 
term 


Al  —  A2 

Al  -I-  A2 


(3) 


Note  that  since  true  asymmetry  R  lies  between  -1  and 
-1-1,  so  should  any  estimate  of  asymmetry  as  well  as  the 
endpoints  of  any  confidence  interval  for  asymmetry. 

In  order  to  estimate  the  asymmetry,  experimenters 
measure  background  in  a  third  independent  experiment. 
Suppose  that  this  experiment  also  lasts  time  i.  Further, 
assume  that  the  experimental  conditions  for  the  back¬ 
ground  measurement  are  the  same  as  for  the  other  ex¬ 
periments.  The  number  of  detected  background  counts 
Nbg,3  is  modeled  as  a  realization  of  a  Poisson  process 
with  parameter  XBGt-  With  this  third  measurement,  ex¬ 
perimenters  typically  estimate  asymmetry  as 


Nj  -  Nj 

Ni  Ni  -  2  Nbg,3  ■ 


(4) 
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As  the  duration  of  the  experiment  increases,  R  con¬ 
verges  to  R.  However,  for  short  experiments,  R  may  be 
far  from  R.  Moreover,  for  short  enough  experiments,  the 
measured  background  can  be  greater  than  one  of  the  other 
two  signals  and  the  asymmetry  computed  from  the  data  is 
not  between  -1  and  1.  That  is,  the  above  estimate  is  out¬ 
side  the  physically  meaningful  range.  For  such  cases,  it 
would  seem  as  though  the  experiment  was  a  failure.  Here, 
useful  information  is  extracted  from  such  data  using  the 
bootstrap. 

3  Bootstrap  Approach 

Using  a  parsimetric  bootstrap  (Efron,  1982)  approach, 
replications  of  the  observed  data  (Ni,  N2,  Nbg,3) 
obtained  by  simulating  Poisson  random  variables  with 
means  Ni,  N2  and  Nbg,3-  To  insure  physically  meaning¬ 
ful  results,  replications  for  which  simulated  background  is 
larger  than  either  of  the  other  signals  are  discarded.  Also, 
replications  for  which  twice  background  equals  the  sum 
of  the  other  signals  are  discarded.  This  second  condition 
insures  that  computed  asymmetry  is  well  defined.  Thus, 
the  /b‘*bootstrap  replication  of  the  observed  data  satisfies 
the  following  constraint. 


iVBG,3*  <  ./Vi* 

(5) 

(6) 

2  Nbg,3’‘  <  iVl*  +  ^2* 

(7) 

Because  of  this  constraint,  the  three  simulated  signals 
are  correlated  with  one  another.  The  true  asymmetry  is 
estimated  by  the  mean  of  the  bootstrapped  asymmetry 
statistics.  A  confidence  interval  is  also  computed  from 
the  histogram  of  the  bootstrapped  asymmetry  statistics. 

4  Applications 

4.1  High  R 

First,  the  Poisson  parameters  for  the  data  were  set  to 
(240, 60, 50).  For  this  case,  true  asymmetry  is  0.9.  One 
thousand  data  sets,  where  simulated  background  is  larger 
than  one  of  the  other  two  signals,  were  simulated.  For 
each  data  set,  10,000  bootstrap  replications  were  simu¬ 
lated  as  described  earlier.  In  Figure  1,  the  histogram  of 
bootstrapped  asymmetry  statistics  for  one  of  simulated 
datasets,  (Ai,  iVj,  IVbg, 3)  =  (220,59,65),  is  shown.  For 
this  particular  data  set,  the  mean  of  the  10,000  boot¬ 
strapped  asymmetry  statistics  was  0.924.  Hence,  the 
bootstrap  estimate  of  asymmetry  R  is  0.924.  This  is  very 
close  to  the  true  value  of  0.9  !  Intuitively,  the  method 
worked  well  because  Ni  was  much  larger  than  N2.  The 
fact  that  the  two  measurements  are  far  apart  is  telling  us 


that  asymmetry  is  high  even  though  bsickground  is  larger 
than  N2. 

In  Table  1,  bootstrap  estimates  of  asymmetry  are  listed 
for  ten  simulated  data  sets.  The  average  of  the  bootstrap 
estimates  for  all  1000  data  sets  was  0.938.  The  standard 
error  of  this  average  value  is  only  0.0004.  Thus,  the  boot¬ 
strap  estimate  of  asymmetry  is  slightly  biased.  Although 
slightly  biased,  root  mean  squcire  prediction  error  {RM S) 
was  only  0.039. 

A  confidence  interval  for  true  asymmetry  is  computed 
from  the  histogram  of  bootstrapped  statistics  as  follows. 
If  the  asymmetry  estimated  from  the  observed  data  R  is 
larger  than  unity,  i.e.  Ni  >  N2,  a  one-sided  confidence 
interval  is  computed.  The  upper  endpoint  of  the  interval 
is  unity.  The  lower  endpoint  is  the  5%  percentile  of  the 
bootstrap  histogram.  If  i.e.  N2  >  Ni,  the  lower  endpoint 
is  set  to  -1  and  the  upper  endpoint  is  the  95%  percentile 
of  the  bootstrap  histogram.  If  Ni  =  N2,  the  confidence 
interval  endpoints  are  the  2.5%  and  97.5%  percentiles. 

In  Table  1,  confidence  intervals  are  listed  for  the  ten 
data  sets.  For  the  1000  simulated  data  sets,  true  asym¬ 
metry  0.9  was  outside  the  computed  bootstrap  confidence 
interval  9  times  out  of  1000.  That  is,  coverage  was  99.1%. 
All  the  upper  endpoints  were  unity.  Hence,  for  this  case, 
the  bootstrap  method  gave  conservative  95%  confidence 
intervals. 


Table  1.  High  Asymmetry. 


Ni 

N2 

NbG.3 

R 

c.i. 

220 

59 

65 

0.924 

(0.803,1.0) 

243 

63 

64 

0.916 

(0.796,1.0) 

227 

50 

53 

0.928 

(0.817,1.0) 

224 

50 

56 

0.935 

(0.829,1.0) 

236 

48 

60 

0.953 

(0.867,1.0) 

230 

40 

62 

0.968 

(0.899,1.0) 

238 

44 

51 

0.947 

(0.856,1.0) 

219 

66 

72 

0.915 

(0.779,1.0) 

236 

50 

61 

0.948 

(0.853,1.0) 

261 

50 

51 

0.936 

(0.840,1.0) 

4.2  Intermediate  R 

The  same  kind  of  analysis  done  above  was  repeated  for  the 
case  where  the  true  Poisson  parameters  were  assumed  to 
be  (460,420,400).  Here,  true  asymmetry  is  0.5.  In  Table 
2,  the  bootstrap  estimate  of  asymmetry  and  a  confidence 
interval  are  listed  for  ten  data  sets.  For  the  data  set 
(433,393,394),  the  histogram  of  bootstrapped  asymmetry 
statistics  is  shown  in  Figure  2.  Note  that  this  histogram  is 
more  dispersed  than  the  one  for  the  data  set  (220,59,65). 
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Table  2.  Intermediate  Asymmetry. 


Ni 

N2 

NbG.3 

R 

c.i. 

433 

393 

394 

0.416 

(-0.217,1.0) 

435 

421 

423 

0.148 

(-0.692,1.0) 

461 

420 

444 

0.415 

(-0.368,1.0) 

487 

430 

439 

0.532 

(-0.014,1.0) 

447 

378 

394 

0.626 

(0.171,1.0) 

479 

391 

406 

0.689 

(0.304,1.0) 

493 

390 

391 

0.703 

(0.368,1.0) 

484 

395 

401 

0.673 

(0.298,1.0) 

481 

408 

410 

0.604 

(0.170,1.0) 

434 

441 

437 

-0.074 

(-1.0,0.793) 

The  mean  value  of  the  1000  bootstrap  estimates  of 
asymmetry  is  0.497.  The  standard  error  of  this  average 
is  0.007.  Although  the  bootstrap  estimate  is  not  signifi¬ 
cantly  biased,  root  mean  square  error  is  larger  than  be¬ 
fore.  Here.  RMS  =  0.215  whereas  before,  i.e.  for  the 
high  asymmetry  case,  RMS  was  over  five  times  less. 

The  true  value  of  the  asymmetry  fell  in  the  confidence 
interval  constructed  from  the  bootstrapped  asymmetry 
statistics  988  out  of  1000  times  (98.8%).  Seven  times 
the  lower  endpoint  was  greater  than  0.5.  Five  times  the 
upper  endpoint  was  less  than  0.5.  Thus,  the  bootstrap 
confidence  interval  is  again  conservative. 


4.3  Interval  Width 

In  Figure  3,  the  width  of  the  confidence  interval  for  each 
of  the  2000  simulated  data  sets  from  the  high  and  interme¬ 
diate  asymmetry  study  are  plotted  versus  H  ~  ||. 

This  ratio  is  a  measure  of  how  close  Ni  and  N2  are  to  one 
another.  In  general,  the  intervals  are  broadest  when  Ni 
and  N2  are  closest. 


4.4  Background  Study 

In  order  to  study  how  background  affects  the  accuracy 
of  the  bootstrap  estimate,  the  Poisson  parameters  were 
set  to  be  (Ai,  A2,  Abg)  =  (100 -I- x,5 -f- x,a:)  where  x  = 
5,10,20,50,100,200,500,1000.  Asymmetry  is  0.905  for 
each  value  of  x.  For  each  set  of  parameters,  1000  data 
set  .,  where  background  exceeds  one  of  the  other  signals, 
were  simulated.  A  confidence  interval  and  an  estimate  for 
asymmetry  were  computed  for  each  data  set.  In  Table  3, 
the  average  of  the  estimates  with  the  standard  deviation 
of  the  estimates  in  parentheses,  root  mean  square  error, 
coverage  fraction  and  average  length  of  the  confidence 
intervals  are  listed. 


Table  3.  (Ai,  A2,  Aflc)  =  (100 -|- x, 5  x,  x). 


X 

Ave.R 

RMS 

Coverage 

|c.i.| 

5 

0.963(0.010) 

0.059 

0.854 

0.108 

10 

0.949(0.013) 

0.046 

0.974 

0.142 

20 

0.932(0.018) 

0.033 

1.000 

0.184 

50 

0.893(0.026) 

0.029 

1.000 

0.272 

100 

0.853(0.035) 

0.062 

1. 000 

0.359 

200 

0.796(0.059) 

0.124 

1.000 

0.478 

500 

0.702(0.098) 

0.225 

1.000 

0.678 

1000 

0.601(0.145) 

0.338 

1.000 

0.896 

For  X  <  20,  the  asymmetry  estimate  was  biased  high. 
For  larger  backgrounds,  the  estimate  W£is  biased  low.  This 
downward  bias  for  high  background  is  plausible  because 
the  difference  between  Ni  and  N2,  in  units  of  standard  de¬ 
viations  of  either  one,  diminishes  as  background  increcises. 
As  the  standardized  difference  between  signals  tends  to 
zero,  confidence  in  claiming  that  true  asymmetry  is  close 
to  unity  diminishes. 

As  background  increases,  both  the  variability  of  Rand 
the  average  confidence  interval  length  increase.  However, 
RMS  does  not  increase  monotonically  since  RMS  de¬ 
pends  on  both  bias  and  variability.  The  coverage  of  the 
bootstrap  confidence  intervals  for  low  background  was  less 
than  95%.  This  probably  is  due  to  both  the  shortness  of 
the  intervals  and  the  bias  of  the  estimate.  At  larger  back¬ 
ground  levels,  bias  is  greater  but  the  confidence  intervals 
are  longer  and  coverage  is  100%. 

4.5  Other  Examples 

For  the  case  (Ai,A2,A3)  =  (600,505,500),  the  aver¬ 

age  bootstrap  estimate  of  asymmetry  was  0.702  whereas 
true  asymmetry  was  0.905  (Table  3).  When  the  second 
parameter  is  changed  to  510,  true  asymmetry  drops  to 
0.818  from  0.905  (  Table  4).  However,  the  expected  value 
of  R  was  almost  the  same  and  the  variability  of  R  dimin¬ 
ished  only  slightly.  This  is  reasonable;  if  the  difference 
between  A2  and  Xbg  i®  very  slight,  the  expected  value 
of  the  bootstrap  estimate  will  depend  mostly  on  Ai  and 

Aflc- 


Table  4. 


A2 

R 

Ave.R 

RMS 

600 

505 

500 

0.905 

0.702(.098) 

0.225 

600 

510 

500 

0.818 

0.704(.087) 

0.144 

573.5 

531.5 

500 

0.400 

0.473(0.236) 

0.247 

563 

542 

500 

0.200 

0.295(0.319) 

0.333 

In  two  other  examples,  the  first  and  second  parameter 
were  both  adjusted  so  that  true  asymmetry  was  0.4  and 
0.2.  However,  the  sum  of  the  two  parameters  was  invari¬ 
ant.  Thus,  the  difference  between  A2  and  Xbg  i®  increased 
as  the  difference  between  Aj  and  A2  is  diminished.  For 
these  ca.ses,  the  variability  of  R  and  RMS  were  greater 
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than  for  the  high  asymmetry  example.  However,  bias  was 
less.  The  results  are  summarized  in  Table  4.  Note  that 
in  parentheses,  the  standard  deviation  of  ft  is  indicated. 
The  coverage  of  the  bootstrap  confidence  intervals  for  the 
second,  third  and  fourth  examples  in  Table  4  were  100%, 
98.4%  and  95.8%. 

A  simulation  study  was  also  done  for  the  example 
(Ai,A2,A3)  =  (605,499,475).  For  this  case,  true  asym¬ 
metry  is  0.688.  The  average  of  the  bootstrap  estimate 
of  asymmetry  for  a  1000  data  set  study  was  0.739(.078), 
RMS  was  0.092  and  coverage  was  100%. 

5  Conclusion 

For  cases  where  the  observed  background  signal  was 
larger  than  either  of  the  other  signals,  the  asymmetry 
computed  from  the  data  is  nonsensical.  Using  a  bootstrap 
approach,  true  asymmetry  and  an  associated  confidence 
interval  were  estimated.  In  the  bootstrap  method,  repli¬ 
cated  data  sets  satisfied  a  constraint  that  insured  phys¬ 
ically  meaningful  results.  For  all  cases  except  very  low 
background  signal  cases,  the  bootstrap  95%  confidence 
intervals  were  conservative.  For  some  cases,  the  boot¬ 
strap  estimate  for  asymmetry  had  very  small  prediction 
error.  The  prediction  error  of  the  bootstrap  estimate  was 
greatest  for  cases  where  the  background  was  very  large 
relative  to  the  other  signals.  The  bootstrap  estimate  of 
asymmetry,  i.e.  the  mean  of  the  bootstrapped  asymmetry 
statistics,  was  biased  in  general.  The  magnitude  of  the 
bias  was  greatest  for  cases  where  background  was  very 
high  and  asymmetry  was  was  close  to  unity  (0.905).  For 
other  high  background  cases  where  asymmetry  was  less 
extreme,  bias  was  less  but  RMS  was  larger. 
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Figure  1.  Bootstrap  replications  of  asymmetry  statistic 
for  data  set  (220,59,65) 


Figure  2.  Bootsrap  replications  of  asymmetry  statistic 
for  data  set  (433,393,394) 


Figure  3.  Length  of  Bootstrap  confidence  intervals. 
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Abstract 

We  consider  the  problem  of  constructing  confidence  in¬ 
tervals  for  possibly  “messy”  functions  of  a  multinomial 
parameter.  The  number  of  categories  can  be  large  and 
the  sample  size  small,  meaning  that  the  problem  of 
sparseness  must  be  confronted.  Thus,  standard  asymp¬ 
totics  based  on  the  delta  method  will  often  prove  un¬ 
satisfactory.  Alternatives  to  the  delta  method  include: 
(1)  Madansky’s  method,  based  on  constrained  maximum 
likelihood;  (2)  the  bootstrap;  and  (3)  intervals  derived 
from  the  brute  force  (Monte  Carlo)  c2ilculation  of  exact 
confidence  regions.  These  approaches  are  discussed  and 
contrasted  in  the  context  of  an  empirical  problem. 

1  Introduction 

Let  IT  denote  the  vector  parameter  of  a  multinomial  dis¬ 
tribution,  and  let  6  =  6{tt)  be  a  “smooth”  (i.e.,  dif¬ 
ferentiable)  scal^lr-valued  function  of  ir.  Suppose  that 
a  random  sample  of  size  N  is  taken  from  Multin(T), 
from  which  we  wish  to  construct  a  confidence  interval 
for  6.  We  can  approach  this  problem  in  a  variety  of 
ways,  ranging  from  computationally  intensive  “exact” 
methods,  to  the  bootstrap,  to  less  computationally  in¬ 
tensive  but  approximate  methods  based  on  asymptotic 
arguments.  But,  how  workable  are  these  methods  in  a 
particular  instance  where  both  the  dimensionality  of  n 
and  N  are  l2irge,  but  N  is  not  large  enough  to  justify 
faith  in  the  validity  of  asymptotic  approximations? 

This  paper  is  a  summary  of  ongoing  resccirch  moti¬ 
vated  by  a  problem  arising  from  estimating  the  degree 
of  concentration  of  an  economic  market.  An  important 
concern  of  the  Antitrust  Division  of  the  U.S.  Department 

*This  paper  does  not  purport  to  represent  the  policy  or  views 
of  the  U.S.  Department  of  Justice 


of  Justice  is  to  maintain  competitive  economic  markets. 
Markets  that  are  the  least  concentrated — that  is,  those 
that  are  not  dominated  by  a  small  number  of  firms — tend 
♦o  be  the  most  responsive,  other  things  being  equal,  to 
the  discipline  of  competition.  Over  time,  a  given  market 
will  become  more  concentrated  if  existing  firms  fail  or 
exit  the  picture;  if  a  few  highly  successful  competitors 
gain  larger  market  shares;  or,  if  mergers  occur  among 
existing  firms.  The  first  two  of  these  often  arise  from 
market  forces  themselves,  and  are  seldom  amenable  to 
or  appropriate  for  regulatory  control.  Mergers,  however, 
can  be  and  often  are  contested  by  the  Antitrust  Division 
in  the  interest  of  keeping  markets  competitive. 

The  decision  to  contest  a  merger  depends  on  a  com¬ 
plex  analysis,  including  the  level  of  concentration  in  the 
market  both  before  and  after  the  merger.  While  there 
is  no  single  right  way  of  measuring  concentration,  the 
Herfindahl  Index  has  emerged  as  a  favorite  in  much  of  the 
microeconomic  literature,  and  in  the  Antitrust  Division 
since  1982.  If  a  market  has  K  firms,  with  market  shares 
’fi)  •  •  • .  where  >  0  and  x*  =  1,  the  Herfind¬ 
ahl  Index  is  is  defined  hy  H  =  10,000  x  xj.  It  is 

easy  to  see  that  H  Eissumes  its  smallest  possible  value  of 
10, 000/ A'  when  the  market  is  least  concentrated  (equal 
market  shares  for  all  firms),  its  largest  possible  value  of 
10,000  when  it  is  most  concentrated  (monopoly),  and 
is  increased  if  two  or  more  of  the  firms  merge.  One  of 
the  great  virtues  of  the  Herfindahl  Index  is  its  simplic¬ 
ity:  even  nonmathematically  astute  judges,  lawyers,  and 
jurors  c;  i  understand  it,  and  multiplication  by  10,000 
eliminal  s  the  need  for  fractions. 

2  Estimation 

Occasionally,  Herfindahls  are  estimated  when  market 
share?  are  imputed  from  a  “random  sample”  of  con- 
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sumers.  The  implied  model  is  that  a  randomly  sampled 
consumer  responds  in  favor  of  firm  k  with  probability 
equal  to  its  market  sheire.  Suppose  that  N  consumers 
axe  sampled,  and  that  wt  ate  the  sampled  proportions  of 
the  firms.  The  naive  estimate  of  H  is  then  H  = 

(we  will  from  here  on  drop  multiplication  by  10,000). 
The  mean  and  variance  of  H  are  easily  found  to  be 


N 

N 


H 


=  #  +  ^  +  ^.  (2.1) 

where  =  4(T  -  H^),  A2  =  6H  +  lOH^  -  16T,  A3  = 
6(2T—  H  ~  H^),  and  T  =  5Zi=i  ’’’f-  Using  the  fact  that 
<  T  <  H  gives 


Var{H)  <  H(l~  H)rff  ,  (2.2) 

where  Xfj  =  4N~^  +  6iV”^  +  6N~^.  Appealing  to  the 
asymptotic  normality  of  an  asymptotically  conser¬ 
vative  100(1  -  2a)%  confidence  interval  for  H  can  be 
derived  as  follows: 


9a,N  ±  ^ 

h.N- 

.4(H-i)’ 

(^)*  +  4rAf] 

2 

where  9a, N  =  2  (~/^)  +  (see  Bickel  and 

Doksum  [1977],  p.  160).  Replacing  Za  by  a  Chebyshev 
inequality  bound  (i.e.,  >720  for  1.96  at  a  =  .025)  gives 
a  confidence  intervaJ  with  guaranteed  coverage  for  every 

N. 

Suppose  that  firms  1  and  2  propose  to  merge.  This 
would  increase  the  Herfindeihl  index  by  an  amount  equal 
to  AH  =  2x1  t2.  Using  arguments  similar  to  those 
above,  exact  expressions  for  the  mean  and  variance 
of  AH  can  be  given,  and  a  confidence  interval  simi¬ 
lar  to  (2.3)  can  be  derived.  In  particular,  E{AH)  = 
(^^)A/f,  from  which  it  is  seen  that  the  estimated 
Herfindahl  tends  to  be  biased  upward,  but  that  the  esti¬ 
mated  change  due  to  a  merger  tends  to  be  biased  down¬ 
ward. 

Usually,  the  market  shares  {x*}  are  regarded  as  known 
quantities,  and  H  is  not  the  subject  of  statistical  infer¬ 
ence.  Even  with  this  knowledge,  however,  the  issue  of 
inference  may  arise  if  a  more  detailed  analysis  is  desired. 
Suppose  that  the  universe  of  consumers  is  partitioned 
into  J  strata,  with  xjt j  representing  the  share  of  the 
market  belonging  to  firm  k  and  stratum  j,  and  x*  .  and 


X,  j  denoting  marginalized  market  shares.  A  “weighted” 
Herfindahl  Index  is  given  by 


which  is  simply  a  convex  combination  of  the  stratum- 
specific  Herfindahls.  As  before,  let  rtj  denote  the  esti¬ 
mated  cell  shares  based  on  a  random  sample  of  N  con¬ 
sumers.  We  will  assume  ihai  the  marginal  firm  shares 
Pi^  =  xt_,  are  known,  giving  itj  =  ^kjPkl^k,^  as  the 
MLE  of  Xfcj.  Take  x,_j  =  x* j.  Substituting  these 
estimates  into  (2.4)  gives  an  estimated  weighted  Herfind¬ 
ahl  that  we  will  denote  H^'^'>,  with  AH^'"^  obtained  in  a 
similar  manner. 

In  one  particular  problem.  A"  =  10,  J  =  5,  =  400, 

and  many  cells  in  the  x^j  table  were  empty.  Some  of 
these  zeros  were  surely  structural,  but  others  were  plau¬ 
sibly  induced  by  sampling.  It  seems  intuitive  that  the 
weighted  estimates  should  exhibit  greater  bias  than  their 
unweighted  counterparts,  and  simulations  have  borne 
this  out.  One  might  ask  whether  taking  advantage  of 
knowing  the  firm  shares  P*  is  more  trouble  than  it  is 
worth:  for  instance,  an  estimate  of  taking  the  form 


is  unbiased  conditioned  on  Nir.j  >  1,  with  Hj  denot¬ 
ing  the  estimated  Herfindahl  for  stratum  j  using  the  ob¬ 
served  proportions.  We  used  the  constrained  estimates 
because  they  were  proposed  by  the  parties  contemplat¬ 
ing  the  merger,  who  could  argue  that  the  unconstrained 
estimates  failed  to  take  advantage  of  known  information, 
leading  to  discrepancy  measures  that  unfairly  worked 
against  them. 


3  Confidence  Intervals 

Between  the  blunt  edges  of  asymptotics  and  brute 
computing  force  lie  a  variety  of  methods  for  de¬ 
riving  confidence  intervals.  We  discuss  several  in 
the  context  of  estimating  Herfindahl  indices.  Our 
conclusions  are  based  on  simulation  exercises  taking 
A-  =  10,  J  =  5,  and  iV  =  400;  (Pi,...,Pio)  = 
(.20,  .20,  .20,  .15,  .10,  .05,  .03,  .03,  .03,  .01);  equal  stra¬ 
tum  probabilities  of  .2;  and,  independence  between  stra¬ 
tum  and  firm.  Figure  1  shows  histograms  of  1000  simu¬ 
lated  values  of  (truth  =  .1578)  and  (truth 

=  .08). 
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ues  of  &(■)  evaluated  over  the  region  can  be  called  a 
100(1  —  2q)%  confidence  interval  for  <?(ir).  There  are 
many  ways  to  choose  C„,  and  not  all  of  them  will  pro¬ 
duce  good  (narrow)  confidence  intervals  for  0(w).  In  fart, 
the  best  confidence  regions  will  not  generally  produce  the 
best  confidence  intervals.  Intuition  suggests  that  desir¬ 
able  confidence  regions  will  follow  the  contours  of  0{  )  as 
closely  as  possible,  and  will  concentrate  as  much  of  their 
mass  as  possible  between  them. 


Figure  1:  Simulation  Histograms 


3.1  The  Delta  Method 

Both  and  are  consistent  and  asymptotically 

normal  estimates,  because  they  are  “regular”  functions 
of  the  observed  cell  proportions.  Thus,  asymptotic  stan¬ 
dard  errors  can  be  derived  by  evaluating  the  gradients 
of  these  functions  at  the  observed  proportions,  and  us¬ 
ing  the  facts  that  Var(7rkj)  =  7V“'n-ij(l  —  ir^j)  and 
Cov (n-jtj, TTr,,)  =  jTTr,,.  This  is  tedious  and  ul¬ 

timately  not  satisfying,  becau.se  (1)  the  standard  error 
estimates  are  poor;  (2)  the  estimates  are  substantially 
biased;  and  (3)  the  estimates  have  sampling  distribu¬ 
tions  that  are  not  very  normal-like.  It  is  interesting  to 
note  that  all  of  the  simulated  values  of  exceeded 
the  truth,  while  exhibited  less  bias.  These  results 

are  summarized  below: 

Bias  SE 

Truth  Mean  Simulation  Theory 

//(“'>  .1578  .1666  ^025  ^009 

AZ/l'")  .0800  .0793  .0017  .0023 

The  “simulation  SE”  is  the  standard  deviation  of  the 
simulated  values.  The  “theory  SE”  refers  to  the  aver¬ 
age,  across  simulations,  of  the  estimated  standard  error 
obtained  from  the  delta  method. 

It  is  often  worthwhile  to  apply  a  variance-stabilizing 
transformation,  if  known,  when  deriving  confidence  in¬ 
tervals.  This  cannot  be  done  exactly  in  the  present  con¬ 
text,  due  to  the  presence  of  “nuisance  parameters.”  In 
the  case  of  //,  (2.2)  suggests  that  an  arcsine  transforma¬ 
tion  may  come  close  to  doing  the  job,  and  it  may  be  a 
good  first  guess  for  as  well. 

3.2  Confidence  Regions 

If  Ca  is  a  100(1  —  2a)%  confidence  region  for  the  multi¬ 
nomial  parameter  tt,  the  minimum  and  maximum  val- 


3.3  Constrained  MLE 

A  likelihood-based  confidence  interval  can  be  obtained 
for  0{n)  in  the  following  manner.  Let 


Lin-X)  = 


(3.1) 


where  A'  =  [xt  j]  is  the  matrix  of  observed  cell  frequen¬ 
cies.  Let  Sk,j  denote  the  region  of  the  unit  simplex  in 
that  has  marginal  firm  shares  equal  to  the  known 
quantities.  Fix  a  value  0o.  and  let  Sk,j{0o)  denote  the 
subset  of  Skj  for  which  9(Tr)  =  Oq.  The  MLE  of  t  over 
Sk,j  and  Sf(,j{0{))  will  be  denoted  ir  and  7r(9n)  respec¬ 
tively. 

Consider  a  test  of  the  hypothesis  7f(tlo)  :  ^(fl")  =  9o 
based  on  the  statistic 


R(9o;^l  = 


L(if;  A’) 
i(i(0o);  A)  • 


(3.2) 


For  JT  €  21og  R(ffc;  A’)  is  cisymptoiically  dis 

tributed  as  chi-square  with  one  degree  of  freedom.  This 
can  be  used  to  test  7l(9o),  with  the  set  of  9o  for  which 
7i(do)  is  accepted  giving  an  asymptotically  valid  confi¬ 
dence  interval  for  9(ie).  Recently.  Owen  (1990)  has  ex¬ 
tended  this  classical  idea  to  a  nonparametric  context. 

One  difficulty  with  this  approach  is  the  computation 
of  the  constrained  MLE  i((lo)-  Treating  it  as  a  Lagrange 
multiplier  problem  requires  the  simultaneous  solution  of 
a  large  set  of  nonlinear  equations,  a  numerically  diffi¬ 
cult  problem  for  which  no  method  is  guaranteed  safe 
and  sure.  Projected  gradient  methods  may  offer  the  best 
hope,  provided  that  good  starting  values  are  available, 
which  is  often  the  case.  VVe  have  used  projected  gradi¬ 
ents  with  success  only  in  lower  dimensional  problems. 

If  9  is  a  1-1  function  over  the  range  of  <?,  applying 
g~^  to  a  confidence  interval  obtained  for  g  o^(3r)  gives 
a  confidence  interval  for  9{-k).  Sometimes  working  with 
go9  offers  computational  advantages  over  working  with 
9.  One  particular  choice  for  g  that  often  seems  to  work 
well  is  the  Lagrange  multiplier  attached  to  the  constraint 
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0(-n)  =  Oq,  an  idea  which  is  due  to  Madansky  (1965).  For 
example,  {or  0  —  H  (unweighted  Herfindahl),  it  can  be 
shown  that  0o  and  the  Lagrange  multiplier  A  are  in  a  I-l 
decreasing  relationship,  and  that  solving  the  constrained 
“normal  equations”  reduces  to  finding  the  fixed  point  of 
a  contraction  mapping  when  A  is  positive. 

The  reliance  on  asymptotics  can  be  obviated  if  the  ex¬ 
act  finite  sample  distribution  of  R(0o',  Af)  is  used.  This 
is  usually  not  feasible  analytically,  but  it  can  be  done 
via  simulation  within  the  context  of  a  projected  gradi¬ 
ent  problem.  We  have  not  tried  this,  so  the  extent  to 
which  it  pays  off  to  expend  the  additional  effort  is  not 
clear. 

3.4  The  Bootstrap 

An  advantage  of  using  the  bootstrap  to  construct  con¬ 
fidence  intervals  is  its  ease  of  implementation,  which  in 
its  simplest  form  is  basically  the  same  for  all  problems. 
We  have  used  the  bootstrap  to  construct  confidence  in¬ 
tervals  for  H,  AH,  and  AH^'^\  Our  conclusion 
is  that  the  simplest  use  of  the  bootstrap  produces  dis¬ 
appointing  results,  but  that  it  offers  a  fertile  area  for 
experimentation. 

The  “simple”  bootstrap  proceeds  by  taking  B  random 
samples  from  a  Multin  (ir)  distribution,  computing  esti¬ 
mates  of  the  desired  quantity  from  the  B  bootstrap  sam¬ 
ples,  and  choosing  quantiles  of  the  bootstrap  estimates 
corresponding  to  the  desired  confidence  level.  This  tech¬ 
nique  can  produce  good  confidence  intervals  if  certain 
conditions  are  satisfied.  One  condition — that  the  esti¬ 
mates  be  asymptotically  normal  and  consistent — is  in 
our  view  not  too  severe.  The  estimates  should  also  ex¬ 
hibit  little  or  no  finite  sample  bias,  and  the  effect  of 
not  knowing  the  values  of  any  nuisance  parameters  that 
may  be  present  should  be  negligible.  These  last  two 
conditions  are  more  troublesome,  and  embellishments 
to  the  simple  bootstrap  have  been  made  to  deal  with 
them:  various  bias-correction  schemes  and  pivoting,  re¬ 
spectively.  They,  and  bootstrapping  in  general,  are  dis¬ 
cussed  in  Efron  (1982). 

Our  best  results  were  obtained  for  AH  using  AH,  with 
B  =  1000,  no  bias  correction,  and  no  pivoting.  Based  on 
2-50  replications  of  the  model  described  above,  the  esti¬ 
mated  lower  and  upper  tail  violation  probabilities  for  a 
95%  confidence  interval  were  1.2%  and  2.8%  respectively. 
Other  quantities  fared  much  worse,  with  much  larger- 
than-nominai  violation  probabilities,  and  badly  unbal¬ 
anced  intervals.  Bias  correction  is  obviously  needed,  but 
the  usual  quantile-adjustment  methods  have  not  worked 
well  because  the  bias  tends  to  be  of  a  much  larger  order 
than  the  standard  error. 


One  problem  with  bootstrapping  large-order  multino¬ 
mials  is  sparseness:  the  bootstrap  samples  contain  at 
least  as  many  empty  cells  as  the  root  sample,  and  of¬ 
ten  more.  Thus,  if  the  quantity  of  interest  is  sensitive 
to  sparseness,  the  bootstrap  may  produce  disappoint¬ 
ing  results.  One  outcome  of  this  that  we  have  observed 
is  that  the  variance  within  bootstrap  samples  can  be 
much  smaller  than  the  variance  between,  which  bodes  ill 
for  staying  true  to  nominal  coverage  levels.  A  possible 
corrective  measure  would  be  to  “smooth  out”  sparse¬ 
ness  before  bootstrapping,  using  either  a  Bayesian  or  a 
non-Bayesian  argument.  None  of  these  problems  detract 
from  the  asymptotic  validity  of  the  bootstrap,  but  they 
do  underscore  the  need  to  carefully  study  its  small  sam¬ 
ple  behavior. 

4  Discussion 

While  producing  good  standard  error  estimates  is  usu¬ 
ally  easier  than  producing  good  confidence  intervals, 
many  of  us  prefer  the  latter.  In  the  context  of  estimating 
Herfindahls,  we  felt  that  it  was  important  to  show  that  a 
variety  of  states  of  nature  could  plausibly  explain  a  given 
set  of  sample  results.  However,  even  with  a  full  menu  of 
options  to  choose  from,  constructing  good  small-sample 
confidence  intervals  is  not  an  easy  problem,  and  there  is 
much  room  for  further  experimentation. 
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ABSTRACT.  Adding  “systematic  noise”  to  the  step 
term  of  the  Newton-Raphson  {NR)  root  finding  algo¬ 
rithm  permits  “expected  9-linear  convergence”  and  con¬ 
vergence  almost  surely  to  the  root  for  a  larger  class  of 
functions  and  larger  starting  sets  than  those  for  which 
NR  converges  “deterministically.”  These  results  have 
application  not  only  to  a  wide  range  of  optimization  prob¬ 
lems  but  also  to  understanding  the  behavioral  repertory 
of  animals  undertaking  pheromone  induced  search.  It  is 
shown  that  the  “search”  reduces  in  many  cases  to  find¬ 
ing  the  root  of  a  function  of  two  or  three  dimensions.  In 
cases  (as  in  the  search  of  the  gypsy  moth  for  its  mate) 
where  the  animal  cannot  simply  travel  in  the  direction 
of  increasing  signal  (scent)  randomized  NR  gives  insight 
into  the  search  behavior  required  to  discover  the  signal 
source. 

1.  INTRODUCTION.  The  numerical  determina¬ 
tion  of  a  global  maximum  or  minimum  of  a  function 
g  ;  RT'  — *  7?"  where  R*  represents  Euclidean  <-space  is 
commonly  accomplished  through  an  iterative  algorithm 
Xk  =  Hk-\{xk-\,. . .  ..xo),  k  =  1,2, ...,io  e  E  C  R”', 
where  E  is  the  set  of  initial  solution  estimates  and  {x*}, 
{/fjt}  represent,  respectively,  a  sequence  of  solution  es¬ 
timates  (which  we  call  the  path)  and  operators  on  the 
estimates.  It  hcis  long  been  recognized  that  numerical  al¬ 
gorithms  are  subject  to  “unacceptable  convergence;”  i.e., 
non-convergence  to  a  solution  or  convergence  too  slow  to 
yield  practical  results.  Paths  in  two  or  more  dimensions 
are  particularly  subject  to  traps,  to  being  caught  in  ridges 
or  to  cycling.  Joseph,  et  al  (1990)  present  examples 
where  this  type  of  non-convergence  occurs  for  one  dimen¬ 
sional  paths  2is  well.  To  ameliorate  these  problems,  the 
authors  introduced  nandomicerf  Newton-Raphson  (RN R) 
in  which  a  random  element  is  injected  into  the  Newton- 
Raphson  algorithm  in  order  to  allow  cycles  to  be  broken 
or  to  permit  large  jumps  along  the  path  towards  the  so¬ 
lution,  thereby  increasing  the  speed  of  convergence.  (See 
Joseph,  et  al  (1991)  for  an  application  in  two  dimensions 
to  a  problem  in  seismic  exploration.)  We  examine  RN R 
in  Sec. 2  and  in  Sec. 3  introduce  some  new  results  related 
to  it. 

Of  additional  interest  here  is  that  RN R  serves  as  a 
source  for  conceptual  models  of  animal  search,  especially 
where  chemical  systems  provide  the  dominant  means  of 
communication.  This  matter  is  discussed  in  Secs. 2  and 


3  in  association  with  the  analytical  issues  raised  there. 
Applications  are  discussed  in  Sec.-l. 

2.  RANDOMIZED  NEWTON-RAPHSON 
(RNR):  The  Newton-Raphson  {NR)  algorithm  for  g  : 
R  —  R,  U'  a  compart  interval  in  R.  g'  the  derivative  of 
g,  E  the  set  of  initial  values,  xn  G  /T  C  U’  is 

Xk+\  =  Xk  -  yk.  Xk  e  U' 

Uk  =  g{xk)/g'(xk)  (step  term) 

with  some  stopping  rule.  NR  is  a  root  finding  algorithm 
where  {x*}  is  a  sequence  of  iteratm  which,  under  certain 
conditions,  converges  to  the  root  p  of  g.  W«'  use  NR  as 
a  paradigm  because  of  its  optimality  properties  (Ortega 
and  Rheinboldt  (1970)).  Furthermore,  its  use  permits 
a  concise  presentation  while  at  the  same  time  making 
clear  the  methods  by  which  the  results  may  be  extended 
toother  root  finding  algorithms. 

In  “animal  search”  we  assume  that  the  prey  (or  po¬ 
tential  mate)  omits  a  signal  (such  as  a  gas)  which  pro¬ 
duces  a  continuous  spatial  distribution  having  a  unique 
maximum  h{p)  at  the  location  p  of  the  source  at  any 
given  time.  We  also  assum<'  that  h  has  no  local  min¬ 
ima.  (We  will  subsequently  weaken  these  assumptions.) 
Under  these  conditions,  there  are  many  ways  of  chang¬ 
ing  the  animal  search  problem  into  a  root  finding  prob¬ 
lem.  For  example,  we  can  assume  the  animal  "knows" 
the  threshold  value  of  h{p).  Defining  ^(x)  =  /i(x)  -  /)(/>). 
the  problem  becomes  one  of  finding  the  uniqnr  root  of 
p(x).  Another  method  of  changing  the  problem  to  root 
finding  is  to  let  g{x)  =  h'{x).  Then  the  position  p  of  the 
source  can  be  found  by  .solving  g{x)  =  0.  We  shall  a.s- 
sume  for  the  moment  that  the  “maximization"  problem 
of  the  animal  can  be  transformed  into  finding  the  root  p 
of  a  function.  We  call  the  location  p.  the  targrt  of  the 
algorithm. 

In  Sec. 4,  we  shall  motivate  the  introduction  of  RNR 
into  the  issue  of  animal  search,  but  for  the  moment  we 
present  RNR  as  capable  of  resolving  areas  of  “unaccept¬ 
able  convergence”  as  described  in  Sec.l, 

To  define  RNR  in  one-dimension,  the  step  term 
in  (2.1)  is  now  a  random  variable  ))  =  {g{xk)  + 

Zik)/{g'{xk)  +  ^jk)  where  Z\k  is  a  random  variabl.'  hav¬ 
ing  a  density  and  zero  expectation;  is  either  an  inde¬ 
pendent,  continuous  random  variable  with  the  same  sign 
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as  g'{xk)  or  is  a  constant  with  the  same  sign  as  g'(xt). 
The  RNR  algorithm  is 

Xt+i  =  xie  —  Yic  when  Xjt  €  W,  x*  is  a  realization  of  Xk 
Xk+\  =  Xjfc_i  when  Xk  (re-set)  .  (2.2) 

The  stopping  rule  for  the  algorithm  is:  for  a  specified 
t,  and  5t  =  {x  :  |(;(x)|  <  <}  C  W,  stop  when  Xk+i  = 
^k  G  Sf  (Note,  the  re-set  condition  in  (2.2)  can  be 
replaced  by  the  re-start  condition:  X/b+i  is  uniform  over 
W  when  x*  ^  W.)  Joseph,  et  al  (1990)  obtained  results 
for  the  convergence  (in  a  probabilistic  sense)  of  RNR  to 
the  root.  We  do  not  repeat  them  here  since  we  develop 
stronger  results  in  the  next  section. 

Animal  search  goes  on  in  more  than  one-dimension. 
The  generalization  of  (2.2)  to  higher  dimensions  can  be 
found  in  Joseph,  et  al  (1990).  For  conceptual  purposes, 
our  discussion  mainly  will  be  in  one-dimension  but  the 
results  are  easily  generalized  to  higher  dimensions. 

The  algorithm  (2.2)  permits  steps  of  arbitrary  size, 
steps  which  can  overshoot  the  target  by  a  greater  amount 
than  is  permissible  in  any  animal  tracking  problem.  To 
avoid  unrealistic  step  sizes,  in  all  that  follows,  we  fix  Z2k 
as  a  constant,  the  size  of  which  depends  on  each  partic¬ 
ular  tracking  problem  and  the  time  frame  permitted  for 
each  iteration. 

It  is  important  to  emphasize  that  the  random  vari¬ 
ables  {Zik}  are  injected  into  the  system  by  the  algorithm. 
These  “noise”  terms  are  not  introduced  externally  as  is 
done  with  “Robbins-Monro”  which  attempts  to  extract 
the  signal  ^(x)  from  the  noise.  In  RNR  the  purpose  is 
to  add  “noise”  in  a  controlled  way  to  the  NR  algorithm 
to  obtain  convergence  in  some  cases  where  NR  does  not 
yield  acceptable  convergence. 

3.  STRONG  CONVERGENCE.  We  present  here 
some  new  results  on  the  convergence  of  RNR  to  a  unique 
root  which  are  both  of  general  interest  in  numerical  com¬ 
putation  and  to  the  issue  of  animal  search.  The  ques¬ 
tion,  what  conditions  on  the  density  of  the  random  vari¬ 
able  Zik  are  required  to  insure  convergence,  is  treated  in 
Th.l.  What  happens  to  the  search  if  these  conditions  are 
(mildly)  not  met  is  treated  in  Ex.l. 

Lemma  1.  (Dinwoodie).  Let  {X*}  be  a  sequence  of 
random  variables  such  that  £'(|Xt+i|)  <  cE(\Xk\)  for  all 
k  and  for  some  positive  c  <  1.  Then  X*  — »  0,  a.s.. 

OO 

Proof:  Clearly,  ^  £'|Xt|  <  oo.  Since,  for  every  f  >  0 

and  n,  eP{\Xk\  >  e)  <  £'(|A't|)  so  IlP(|Xt|  >  e)  < 
oo  as  well.  By  the  Borel-Cantelli  lemma,  for  every  t  > 
0,  P(|Xicl  >  £,1.0.)  =  0.  The  conclusion  follows  using 
Chung  (1974),  pg.73,  Th.4.2.2. 


Theorem  1.  Suppose  {X*}  is  a  Markov  process  in  a 
compact  interval  W.  Let  p  represent  the  target  and 
<p(d)  =  E(lXk+i  -pljXk-p  =  d)  for  all  d  6  W  -  {p}. 
Suppose:  (a)  for  all  d  in  a  neighborhood  of  0,  ^(d)  <  c|d| 
for  some  positive  c  <  1;  (b)  the  conditional  density  /  of 
Xt+i  is  bounded  below  as  follows:  /rj,+,|n(l/  1  w)  >  b(w) 
for  all  y,w  G  R,  where  the  bound  6  is  a  continuous  func¬ 
tion  of  w,  positive  except  possibly  for  w  ■=  p.  Then 
Xj:  p  a.s. 

Proof:  By  conditions  (a)  and  (b),  there  are  values  r, 
a  >  0  such  that  (i)  for  |d|  <  r,  L'dXt+i  —  p|  |  X*  —  p  = 
d)  <  cld|  and  (ii)  for  |d|  >  r,  |  d-(-p)  >  a  for 

all  y  G  R.  For  each  k,  let  Yk  =  \Xk  —  pj  Ar.  Then  {Vjt}  is 
a  sequence  of  non-negative,  uniformly  bounded  random 
variables.  For  0  <  <  <  r,  using  (i),  E(Yk+i  |  Yk  =  t)  <  ct. 
When  t  =  r,  using  (ii),  ^(yi+i  |  Vit  =  <)  <  for 
some  positive  constant  c'  <  1.  Consequently,  E(Yk+i)  < 
c"E(Yk)  for  some  positive  constant  c"  <  1.  The  result 
follows  from  Lemma  1 . 

This  result  suggests  that  the  design  of  the  density  of 
the  injected  random  variable  Zik  in  the  algorithm  (2.2)  is 
crucial  to  insuring  almost  sure  convergence  to  the  target. 
Specifically,  this  theorem  demands  that  the  concentra¬ 
tion  of  the  density  of  X*  about  the  target  p  increases 
“rapidly”  as  X't  approaches  p.  The  theorem  gives  an 
indication  of  how  rapidly  this  concentration  must  take 
place.  Moreover,  the  density  of  Zik  must  be  condition¬ 
ally  bounded  below  (condition  (b)  of  the  theorem).  This 
condition  insures  there  are  no  other  points  of  concentra¬ 
tion.  It  can  be  relaxed,  if  condition  (a)  is  correspondingly 
changed. 


Example  1.  Suppose  we  seek  the  root  of  g{x)  =  2x  —  x~. 
Using  RNR,  X^+i  =  x*  -  (2xit  -  xl  +  Zit)/(2  -  2xk  -b 
where  Zik,  Z^k  are  independent  random  variables; 
E{,Zik)  =  0,  Z2k  >  0,  a.s.  {Z2k  could  be  a  positive 
constant.). 

(a)  Observe  that  £'(|X*+i|  |  X*)  <  [2  -  2xit]~' [-c*  + 
\xk\E(Z2k)  +  E(lZik\)]-  For  [xtl  small,  if  we  had  designed 
the  density  of  Zik  so  that  £(|Zn|  |  xit)  <  .9|xt|  and 
the  density  of  Z2k  so  that  E(Z2k  I  a^/t)  <  -9  (or  Zojt  is  a 
constant  less  than  .9),  then  £(17  I  Xk)  <  ■95|xt(  (jx^l 
small).  From  our  results,  we  expect  a.s.  convergence.  We 
performed  a  simple  simulation  on  a  hand  calculator  using 
Z2k  €  U(0,1.8),  Zik  G  £(-1.8|xt|,  1.8|xi|).  Stopping 
when  |3(x)|  <  10“  ',  and  using  Xq  =  .75,  we  arrive  quickly 
at  xe  =  3  X  10~®. 

(b)  Note  that 


^(^k+l 


^k) 


xiE 


/  ^ 


t  +  £2t  (X2t  —  2x^.)-^x^. 


(2-2xk  +  Z2k)~ 


) 

(3.1) 
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If  0  <  Z2k  <  1,  and  we  design  the  density 
of  Zik  so  that  V{Zik  I  a;*)  =  I®*  1/4,  then  the 

term  on  the  right  of  (3.1)  is  greater  than  or  equal  to 
(x^/25)  [(4|a;fc|)"*  +  E(Z^^.  -  2Z2kXk)]-  For  small  |xi|, 

I  is  large.  Thus,  we  do  not  expect  conver¬ 

gence.  In  fact,  letting  Z^k  ~  U  v^3|xt|/4,  ^3|xt|/4^ 
and  Z2k  ~  U{0,  |xj;|)  after  2500  iterations,  the  smallest 
value  of  3(x)  was  .008;  the  sequence  x*  largely  oscillated 
between  —.1  and  .1.  There  appears  to  be  a  barrier  to 
convergence. 

The  following  example  treats  the  question  of  conver¬ 
gence  barriers  more  generally. 

Example  2:  Let  :  [0, 1]  — ►  [0,  .5]  be  any  continu¬ 
ous  function  taking  on  the  value  0  only  at  zero.  Define 
s{x,t)  —  [2v?(0]~^^[o,2(^(()](*)  transition  ker¬ 

nel  of  a  Markov  chain;  i.e.,  with  A"o  having  an  arbitrary 
density  on  [0, 1]  and  given  the  density  fkix)  of  Xk,  the 
density  fk+i{x)  of  Xk+i  is  given  by 

fk+i{x)  =  /  s(x,t)fk(t)dt  . 

Jo 

By  direct  calculation,  E{Xk+i  |  Xk  =  <)  =  ip{t).  If 
V?'(0)  <  1,  then  by  Th.l,  A'„  — *  0,  a  s.  On  the  other  hand, 
if  v?'(0)  >  e/2  =  1.36,  then  A'„  does  not  even  converge 
in  probability  to  0.  In  fact,  we  state  without  proof,  that 
there  exists  a  nontrivial  distribution  Fq  as  close  to  the 
constant  1  as  we  wish,  such  that  if  Fx„{x)  <  Fo{x)  for 
every  x,  then  the  same  is  true  of  Fx^+i  •  Thus,  there  exist 
“stable  barriers”  to  convergence. 

With  respect  to  the  rate  of  convergence,  using  the 
notion  of  ^-linear  convergence  (Dennis  and  Schnabel 
(1983))  we  say  that  we  have  “expected  q-linear  conver¬ 
gence”  if  condition  (b)  of  Th.l  is  satisfied.  In  this  sense, 
Th.l  gives  us  both  the  “expected  rate  of  convergence” 
and  the  certainty  of  it. 

4.  ANIMAL  SEARCH.  Investigators  have  observed 
that  animal  search  often  appears  random.  Their  reports 
suggest  thereby  that  the  search  is  not  purposefully  di¬ 
rected.  As  an  example,  Wilson  (1963)  observes  that  the 
male  gypsy  moth  detecting  the  “faintly  tinted  air”  pro¬ 
duced  by  the  female  (perhaps  thousands  of  meters  dis¬ 
tant)  cannot  fly  in  the  direction  of  increasing  scent  be¬ 
cause  the  “attractant  is  distributed  almost  uniformly  af¬ 
ter  it  has  drifted  a  few  meters  from  the  female.”  Wilson 
then  describes  the  path  of  the  moth:  “. . .  they  simply  fly 
upwind  and  thus  inevitably  move  toward  the  female.  If 
by  accident  they  pass  out  of  the  active  zone,  they  either 
abandon  the  search  or  fly  about  at  random  until  they 
pick  up  the  scent  again.  Eventually  as  they  approach 
the  female,  there  is  a  slight  increase  in  the  concentra¬ 
tion  of  the  chemical  attractant  and  this  can  serve  as  a 


guide  for  the  remaining  distance.”  The  random  flying 
about  depicted  by  Wilson  does  not  suggest  purposefully 
directed  behavior.  Some  consideration  of  the  problem 
from  the  standpoint  of  the  moth  makes  clear  that  this 
random  flying  about  is  an  integral  part  of  the  solution. 
In  fact,  it  is  not  correct  that  simply  flying  upwind  the 
moth  “inevitably  moves  toward  the  female.”  As  Wilson 
shows  in  a  figure  on  page  103,  the  wind  forms  a  plume 
from  the  gas  the  female  emits;  unless  the  moth  happens 
to  happily  be  flying  along  the  “line  of  sight”  to  the  fe¬ 
male,  flying  upwind  must  inevitably  bring  him  to  the 
edge  of  the  plume.  At  this  point  the  moth  must  use 
derivative  information  (decrease  in  intensity)  to  return 
to  the  plume,  not  totally  along  the  downwind  direction 
but  with  some  motion  along  the  line  perpendicular  to 
the  wind’s  path.  The  search  for  the  plume  cannot  be 
“totally  random”  for  this  would  suggest  that  the  moth’s 
motion  could  be  modelled  by  a  “random  walk”  model  in 
two  dimensions.  Such  a  model  results  in  infinite  expected 
time  to  return  to  the  plume.  Hence,  some  deterministic 
component  depending  on  g{x)  (a  function  inversely  pro¬ 
portional  to  intensity)  and  g'(x)  must  be  included  in  the 
“random”  algorithm  dictating  the  moth’s  motion.  These 
observations  are  supported  by  Wilson  who  notes  that  in¬ 
creasing  scent  is  a  “guide”  to  the  moth.  This  algorithm 
must  be  efficient  in  order  that  the  moth  has  a  chance  for 
success  before  it  exhausts  itself. 

Another  type  of  animal  search  is  found  in  the  in¬ 
visible  odor  trails  fire  ant  workers  leave  to  guide  their 
colleagues  to  a  food  source.  The  trail  consists  of  a 
pheromone  laid  down  by  workers  returning  to  their  nest 
after  finding  a  source  of  food.  The  signal  consists  of  in¬ 
termittent  “hot”  spots  which  decrease  in  density  as  the 
distance  from  the  source  increases.  Again,  one  observes 
a  “random”  motion  of  the  ant  as  it  hits  one  “hot”  spot 
and  searches  for  the  next.  Again,  the  path  of  the  ant 
cannot  be  patterned  after  a  random  walk.  The  function 
^(x)  must  be  inversely  proportional  to  a  cvmulalive  sum 
of  “hot  spots,” 

What  can  go  wrong?  As  the  theory  of  the  last  sec¬ 
tion  suggests,  it  is  possible  for  an  algorithm  not  “finely 
tuned”  to  produce  erratic  behavior  even  near  the  source. 
Anyone  who  has  watched  an  exhausted  retriever  try  to 
find  a  source  (such  as  a  familiar  tennis  ball  buried  in 
deep  grEiss)  will  observe  that  the  retriever  at  times  sim¬ 
ply  steps  over  the  source  without  ever  focusing  upon  it. 
It  appears  that  exhaustion  has  distorted  the  algorithm 
into  producing  a  “stable  barrier”  to  convergence. 

5.  FINAL  COMMENTS:  We  have  only  here  touched 
upon  the  connection  between  animal  search  and  random¬ 
ized  algorithms.  We  have  also  investigated  a  number  of 
models,  such  as  “hot  spots”  distributed  over  lattices  and 
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run  a  number  of  simulations  which  appear  to  demon¬ 
strate  the  utility  of  RN R  type  “purposeful  random  mod¬ 
els”  in  describing  the  convergence  (or  lack  of  it)  of  ani¬ 
mals  to  a  target. 

We  have  shown  that  a  random  element  is  vital  to  the 
success  of  searches  based  on  low  intensity  or  intermittent 
signals.  It  would  be  interesting  to  investigate  the  bio¬ 
logical  mechanisms  which  produce  and  regulate  random 
elements.  In  the  case  of  higher  mammals,  “anxiety”  ap¬ 
pears  to  be  such  a  mechanism. 

One  area  which  requires  additional  research  is  that 
in  which  the  signal  is  contaminated  by  additive  noise  in 
such  a  way  that  local  minima  and  maxima  are  produced 
thereby  violating  the  assumption  of  a  unique  maximum. 
It  appears  that  the  external  noise  can  be  added  in  the 
algorithm  to  the  “noise”  Zik  produced  by  the  animal  (or 
algorithm).  Near  the  source,  external  noise  should  have 
little  effect  while  far  from  the  source,  it  may  be  necessary 
to  “design”  Zik  so  that  it  dominates  the  random  process. 

Another  area  which  demands  further  investigation 
because  it  appears  central  to  gaining  insight  into  the  “al¬ 
gorithmic”  mechanism  employed  in  animal  search  is  the 
speed  of  convergence.  Efforts  are  now  being  made  to  find 
the  expected  number  and  variance  of  the  number  of  steps 
to  solution  under  general  conditions. 
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Abstract 

We  present  and  compare  learning  rate  schedules  for 
stochastic  gradient  descent,  a  general  algorithm  which 
includes  LMS,  on-line  backpropagation  and  k-means 
clustering  as  special  cases.  We  introduce  “search-then- 
converge”  type  schedules  which  outperform  the  classical 
constant  and  “running  average”  (1/t)  schedules  both  in 
speed  of  convergence  and  quality  of  solution. 

Introduction:  Stochastic  Gradient  De¬ 
scent 

The  optimization  task  is  to  find  a  parameter  vector  W 
which  minimizes  a  function  (7(W).  In  the  context  of 
learning  systems  typically  G(W)  =  £xE(W,  X),  i.e.  G  is 
the  average  of  an  objective  function  over  the  exemplars, 
labeled  E  and  X  respectively.  The  stochastic  gradient 
descent  algorithm  is 

AW(t)  =  -v(t)VwE{W{t),X{i)). 

where  t  is  the  “time”,  and  Y(<)  is  the  most  recent 
independently-chosen  random  exemplar.  For  compari¬ 
son,  the  deterministic  gradient  descent  algorithm  is 

AW(<)  =  -r,{t)VwSxE{W(t),X). 

While  on  average  the  stochastic  step  is  equal  to  the  de¬ 
terministic  step,  for  any  particular  exemplar  X[t)  the 
stochastic  step  may  be  in  any  direction,  even  uphill  in 
£xE{W{t),X).  Despite  its  noisiness,  the  stochastic  al¬ 
gorithm  may  be  preferable  when  the  exemplar  set  is 
large,  making  the  average  over  exemplars  expensive  to 
compute. 

The  issue  addressed  by  this  paper  is:  which  function 
should  one  choose  for  T){t)  (the  learning  rate  schedule) 
in  order  to  obtain  fast  convergence  to  a  good  local  min¬ 
imum?  The  schedules  compared  in  this  paper  are  the 
following  (Fig.  1): 

•  Constant:  T){t)  =  t)o 


Figure  1:  Comparison  of  the  shapes  of  the  sched¬ 
ules.  Dashed  line  =  constant.  Solid  line  =  search-then- 
converge.  Dotted  line  =  “running-average” 


•  ‘^Running  Average”:  r]{t)  =  vo/(l  +  0 


•  Search-Then- Converge:  T){t)  =  t;o/(1  -H  t/r) 


“Search-then-converge”  is  the  name  of  a  novel  class 
of  schedules  which  we  introducein  this  paper.  The  spe¬ 
cific  equation  above  is  merely  one  member  of  this  class 
and  w^ls  chosen  for  comparison  because  it  is  the  simplest 
member  of  that  class.  We  find  that  the  new  schedules 
typically  outperform  the  classical  constant  and  running 
average  schedules.  Furthermore  the  new  schedules  are 
capable  of  attaining  the  optimal  asymptotic  convergence 
rate  for  any  objective  function  and  exemplar  distribu¬ 
tion.  The  classical  schedules  cannot. 

Adaptive  schedules  are  beyond  the  scope  of  this  short 
paper  (see  however  Darken  and  Moody,  1991).  Nonethe¬ 
less,  all  of  the  adaptive  schedules  in  the  literature  of 
which  we  are  aware  are  either  second  order,  and  thus 
too  expensive  to  compute  for  large  numbers  of  parame¬ 
ters,  or  make  no  claim  to  asymptotic  optimality. 
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Example  Task:  K-Means  Clustering 

As  our  sample  gradient-descent  task  we  choose  a  k-means 
clustering  problem.  Clustering  is  a  good  sample  problem 
to  study,  both  for  its  inherent  usefulness  and  its  illustra¬ 
tive  qualities.  Under  the  name  of  vector-quantization, 
clustering  is  an  important  technique  for  signal  compres¬ 
sion  in  communications  engineering.  In  the  machine 
learning  field,  clustering  has  been  used  as  a  front-end  for 
function  learning  and  speech  recognition  systems.  Clus¬ 
tering  also  has  many  features  to  recommend  it  as  an 
illustrative  stochastic  optimization  problem.  The  adap¬ 
tive  law  is  very  simple,  and  there  are  often  many  local 
minima  even  for  small  problems.  Most  significantly  how¬ 
ever,  if  the  means  live  in  a  low  dimensional  space,  visu¬ 
alization  of  the  parameter  vector  is  simple:  it  has  the 
interpretation  of  being  a  set  of  low-dimensional  points 
which  can  be  easily  plotted  and  understood. 

The  k-means  task  is  to  locate  k  points  (called 
“means”)  to  minimize  the  expected  distance  between 
a  new  random  exemplar  and  the  nearest  mean  to  that 
exemplar.  Thus,  the  function  being  minimized  in  k- 
means  is  €x\\X  -  where  Mnr$t  is  the  near¬ 

est  mean  to  exemplar  X.  An  equivalent  form  is 
!dXPiX)Y:Ui  -  WalP,  where  P{X)  is  the 

density  of  the  exemplar  distribution  and  Ia{X)  is  the 
indicator  function  of  the  Veronois  region  corresponding 
to  the  ath  mean.  The  stochastic  gradient  descent  algo¬ 
rithm  for  this  function  is 

i.e.  the  nearest  mean  to  the  latest  exemplar  moves 
directly  towards  the  exemplar  a  fractional  distance 
ni^nrst)-  In  a  slight  generalization  from  the  stochastic 
gradient  descent  algorithm  above,  tnrtt  is  the  total  num¬ 
ber  of  exemplars  (including  the  current  one)  which  have 
been  assigned  to  mean  Mnr$t- 

As  a  specific  example  problem  to  compare  various 
schedules  across,  we  take  (b  =  9  (9  means)  and  X  uni¬ 
formly  distributed  over  the  unit  square.  Although  this 
would  appear  to  be  a  simple  problem,  it  has  several  ob¬ 
served  local  minima.  The  global  minimum  is  where  the 
means  are  located  at  the  centers  of  a  uniform  3x3  grid 
over  the  square.  Simulation  results  are  presented  in  fig¬ 
ures  2  and  3. 

Constant  Schedule 

A  constant  learning  rate  has  been  the  traditional  choice 
for  LMS  and  backpropagation.  However,  a  constant 
rate  generally  does  not  allow  the  parameter  vector  (the 
“means”  in  the  case  of  clustering)  to  converge.  Instead, 
the  parameters  hover  around  a  minimum  at  an  average 


distance  proportional  to  t)  and  to  a  variance  which  de¬ 
pends  on  the  objective  function  and  the  exemplar  set. 
Since  the  statistics  of  the  exemplars  are  generally  as¬ 
sumed  to  be  unknown,  this  residual  misadjustment  can¬ 
not  be  predicted.  The  resulting  degradation  of  other 
measures  of  system  performance,  mean  squared  classifi¬ 
cation  error  for  instance,  is  still  more  difficult  to  predict. 
Thus  the  study  of  how  to  make  the  parameters  converge 
is  of  significant  practical  interest. 

Current  practice  for  backpropagation,  when  large 
misadjustment  is  suspected,  is  to  restart  learning  with  a 
smaller  ij.  Shrinking  t]  does  result  in  less  residual  mis¬ 
adjustment,  but  at  the  same  time  the  speed  of  conver¬ 
gence  drops.  In  our  example  clustering  problem,  a  new 
phenomenon  appears  as  tj  drops — metastable  local  min¬ 
ima.  Here  the  parameter  vector  hovers  around  a  rela¬ 
tively  poor  solution  for  a  very  long  time  before  slowly 
transiting  to  a  better  one. 

Running  Average  Schedule 

The  running  average  schedule  (»?(<)  =  >?o/(l  +  t))  is  the 
staple  of  the  stochastic  approximation  literature  (Rob¬ 
bins  and  Monro,  1951)  and  of  k-means  clustering  (with 
t)q  =  1)  (MacQueen,  1967).  This  schedule  is  optimal  for 
i  =  1  (1  mean),  but  performs  very  poorly  for  moder¬ 
ate  to  large  k  (like  our  example  problem  with  9  means). 
From  the  example  run  (Fig.  2A),  it  is  clear  that  j)  must 
decrease  more  slowly  in  order  for  a  good  solution  to  be 
reached.  Still,  an  advantage  of  this  schedule  is  that  the 
parameter  vector  has  been  proven  to  converge  to  a  lo¬ 
cal  minimum  (MacQueen,  1967).  We  would  like  a  class 
of  schedules  which  is  guaranteed  to  converge,  and  yet 
converges  as  quickly  as  possible. 

Stochastic  Approximation  Theory 

In  the  stochastic  approximation  literature,  which  has 
grown  steadily  since  it  began  in  1951  with  the  Robbins 
and  Monro  paper,  we  find  conditions  on  the  learning  rate 
to  ensure  convergence  with  optimal  speed  ' . 

From  (Ljung,  1977),  we  find  that  r){t)  — ►  At~P 
asymptotically  for  any  1  >  p  >  0,  is  sufficient  to  guaran¬ 
tee  convergence.  Power  law  schedules  may  work  quite 
well  in  practice  (Darken  and  Moody,  1990),  however 
from  (Goldstein,  1987)  we  find  that  in  order  to  converge 
at  an  optimal  rate,  we  must  have  r]{t)  — ►  c/t  asymptot¬ 
ically,  for  c  greater  than  some  threshold  which  depends 

*The  cited  theory  generally  does  not  directly  apply  to  the  full 
nonlinear  setting  of  interest  in  much  practical  work.  For  more 
details  on  the  relation  of  the  theory  to  practical  applications  and 
a  complete  quantitative  theory  of  asymptotic  misadjustment,  see 
(Darken  and  Moody,  1991). 
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on  the  objective  function  and  exemplars  When  the 
optimal  convergence  rate  is  achieved,  ||W  —  W*||^  goes 
like  1/t. 

The  running  average  schedule  goes  as  t}a/t  asymptoti¬ 
cally.  Unfortunately,  the  convergence  rate  of  the  running 
average  schedule  often  cannot  be  improved  by  enlarging 
r/Ot  because  the  resulting  instability  for  small  t  can  out¬ 
weigh  the  improvements  in  asymptotic  convergence  rate. 

Search-Then-Converge  Schedules 

We  now  introduce  a  new  class  of  schedules  which  are 
guaranteed  to  converge  and  furthermore,  can  achieve  the 
optimal  1  ft  convergence  rate  without  stability  problems. 
These  schedules  are  characterized  by  the  following  fea¬ 
tures.  The  learning  rate  stays  high  for  a  “search  time” 
r  in  which  it  is  hoped  that  the  parameters  will  find  and 
hover  about  a  good  minimum.  Then,  for  times  greater 
than  T,  the  learning  rate  decreases  as  eft,  and  the  pa¬ 
rameters  converge. 

We  choose  the  simplest  of  this  class  of  schedules  for 
study,  the  “short-term  linear”  schedule  (»;(0  =  »7o/(l  + 
</r)),  so  called  because  the  learning  rate  decreases  lin¬ 
early  during  the  search  phase.  This  schedule  has  c  =  ttio 
and  reduces  to  the  running  average  schedule  for  r  =  1. 

Conclusions 

We  have  introduced  the  new  class  of  “search-then- 
converge”  learning  rate  schedules.  Stochastic  approxi¬ 
mation  theory  indicates  that  for  large  enough  r,  these 
schedules  can  achieve  optimally  fast  asymptotic  con¬ 
vergence  for  any  exemplar  distribution  and  objective 
function.  Neither  constant  nor  “running  average”  (1/t) 
schedules  can  achieve  this.  Empirical  measurements  on 
k-means  clustering  tasks  are  consistent  with  this  expec¬ 
tation.  Furthermore  asymptotic  conditions  obtain  sur¬ 
prisingly  quickly.  Additionally,  the  search-then-converge 
schedule  improves  the  observed  likelihood  of  escaping 
bad  local  minima. 

As  implied  above,  k-means  clustering  is  merely  one 
example  of  a  stochastic  gradient  descent  algorithm.  LMS 
and  on-line  backpropagation  are  others  of  great  interest 
to  the  learning  systems  community.  Due  to  space  limi¬ 
tations,  experiments  in  these  settings  will  be  published 
elsewhere  (Darken  and  Moody,  1991).  Preliminary  ex¬ 
periments  seem  to  confirm  the  generality  of  the  above 
conclusions. 

Extensions  to  this  work  in  progress  includes  applica¬ 
tion  to  algorithms  more  sophisticated  than  simple  gra- 

^This  choice  of  asymptotic  n  satisfies  the  necessary  conditions 
given  in  (White,  1989). 


dient  descent,  and  adaptive  search-then-converge  algo¬ 
rithms  which  automatically  determine  the  search  time. 
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Figure  2:  Example  runs  with  classical  schedules  on  9-means  clustering  task.  Exemplars 
are  uniformly  distributed  over  the  square.  Dots  indicate  previous  locations  of  the  means. 
The  triangles  (barely  visible)  are  the  final  locations  of  the  means.  (A)  “Running  average” 
schedule  (v  =  1/(1  -b  <)),  100k  exemplars.  Means  are  far  from  any  minimum  and  pro¬ 
gressing  very  slowly.  (B)  Large  constant  schedule  (q=0.1),  100k  exemplars.  Means  hover 
around  global  minimum  at  large  average  distance.  (C)  Small  constant  schedule  (i;=0.01), 
50k  exemplars.  Means  stuck  in  metastable  local  minimum.  (D)  Small  constant  sched¬ 
ule  (q=0.01),  100k  exemplars  (later  in  the  run  pictured  in  C).  Means  tunnel  out  of  local 
minimum  and  hover  around  global  minimum. 
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Abstract 

Exploratory  Projection  Pursuit  is  a  technique  for 
forming  projections  of  a  multivariate  point  cloud 
and  searching  for  those  projections  that  reveal  the 
most  structure.  The  search  component  is  typically 
some  variant  of  a  steepest  descent  procedure  and, 
particularly  when  the  search  space  is  ill-behaved, 
leaves  open  the  possibility  that  the  best  projection 
will  not  be  found.  Genetic  Algorithms  are  generally 
£q)plicable  optimization  techniques,  well  suited  for 
search  spaces  in  which  more  traditional  techniques 
fail.  This  paper  describes  experiments  designed  to 
ascertain  the  effectiveness  of  Genetic  Algorithms 
as  optimizers  for  exploratory  projection  pursuit. 

1  Introduction 

A  common  goal  of  exploratory  data  analysis  is  to  find  struc¬ 
ture  (clusters,  hyperplanes,  and  the  like)  among  a  configu¬ 
ration  of  points  in  p-dimensional  space.  This  is  a  difficult 
task  for  large  p  because  high-dimensional  space  is  inherently 
empty,  and  procedures  that  rely  on  interpoint  distances  to  es¬ 
tablish  structure  fall  prey  to  the  “curse  of  dimensionality”.  A 
typical  approach  to  the  problem  is  to  reduce  dimensionality 
with  the  hope  that,  in  doing  so,  information  loss  is  minimal. 
Exploratory  Projection  Pursuit  (PP)  [4, 5]  is  a  dimension  re¬ 
duction  technique  for  forming  projections  of  a  multivariate 
point  cloud  onto  subspaces  spanned  (usually)  by  the  first  1 , 2. 
or  3  coordinates,  and  searching  for  those  projections  that  re¬ 
veal  the  most  structure.  The  search  component  of  PP  systems 
is  typically  some  variant  of  a  steepest  descent  procedure  and, 
particularly  when  the  search  space  is  not  well  behaved,  leaves 
open  the  possibility  that  the  b^t  projection  will  not  be  found. 
Genetic  Algorithms  (G  As)  [6]  hold  a  great  deal  of  promise  as 
generally  applicable  optimization  techniques  [1],  and  are  par¬ 
ticularly  suitable  for  search  spaces  in  which  more  traditional 
techniques  fail.  This  paper  provides  a  brief  overview  of  both 
PP  and  GAs,  and  describes  an  implementation  of  PP  in  which 
a  GA  is  used  to  locate  the  most  interesting  projections.  The 
power  of  the  GA  approach  is  illustrated  by  examples  in  which 


the  genetic  PP  algorithm  is  applied  to  datasets  genoated  by 
the  infamous  RANDU  pseudo-random  number  generator. 

2  Exploratory  Projection  Pursuit 

When  tqiplying  PP,  the  analyst’s  goal  is  simply  to  locate  inter¬ 
esting  structure  within  the  high  dimension^  data  space.  The 
basic  paradigm  for  PP  is: 

Assume  the  data  is  unstructured  in  p-space 
REPEAT: 

1.  locate  and  save  directions  indicating  the  presence 
of  structure 

2.  return  to  the  unstructured  assumption  by  removing 
any  structure  found  in  step  7. 

UNTIL:  no  significant  structure  can  be  found 

For  the  purposes  of  this  paper,  the  critical  issues  are  related 
to  Step  1,  namely,  how  the  computer  can  recognize  when  a 
projection  is  “interesting”,  and  how  such  projections  can  be 
located. 

2.1  Evaluating  a  Projection 

Techniques  for  defining  evaluation  functions  that  can  measure 
the  degree  to  which  a  given  projection  reveals  structure  are 
described  in  detail  in  [7].  Due  to  the  ease  with  which  it  can  be 
programmed,  the  evaluation  function  used  in  the  experiments 
in  this  paper  is  the  simple  “clottedness”  index  described  by 
Friedman  and  Tukey  in  [5]. 

The  clottedness  index  was  designed  to  locate  projections  that 
simultaneously  maximize  both  the  overall  “spread”  and  the 
local  density  of  the  datapoints.  In  order  to  keep  the  nota¬ 
tion  simple,  an  index  designed  to  assess  the  clottcdness  of 
a  one-dimensional  projection  direction  o  is  described  here. 
Friedman  and  Tukey  defined  the  clottedness  of  a  as: 

C(a)  =  «(a)d(a)  (1) 

Here,  s{a)  is  a  measure  of  the  overall  variability  of  the  data 
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Figure  3:  Comparison  of  10  runs  over  the  various  schedules  on  the  9-means  cluster¬ 
ing  task  (as  described  under  Fig.  1).  The  exemplars  are  the  same  for  each  schedule. 
Misadjustment  is  defined  as  ||W^  —  (A)  Small  constant  schedule  (77=0.01). 

Note  the  well-defined  transitions  out  of  metastable  local  minima  and  large  misad¬ 
justment  late  in  the  runs.  (B)  “Running  average”  schedule  (77  =  1/(1  -^  <)).  6 
out  of  10  runs  stick  in  a  local  minimum.  The  others  slowly  head  for  the  global 
minimum.  (C)  Search-then-converge  schedule  (77  =  1/(1  +  t/4)).  All  but  one  run 
head  for  global  minimum,  but  at  a  suboptimal  rate  (asymptotic  slope  less  than  -1). 
(D)  Search-then-converge  schedule  (77  =  1/(1  -1-1/32)).  All  runs  head  for  global 
minimum  at  optimally  quick  rate  (asymptotic  slope  of  -1). 
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as  projected  onto  direction  a,  and  is  computed  as  the  trimmed 
standard  deviation  of  the  N  data  points,  as  projected  onto  a. 
Local  point  density  is  defined  as 

N  N 

i  =  l  j  =  l 

In  (2),  r,j  is  a  measure  of  the  absolute  distance  between  any 
pair  of  points  as  projected  onto  a,  and  /( r,j )  is  a  kernel  func¬ 
tion  that  monotonically  decreases  for  increasing  r.  A  local 
cutoff  radius,  R,  defines  the  neighborhood  within  which  point 
density  is  measured,  and  an  indicator  function,  I{r]),  that 
evaluates  to  unity  for  v  >  0,  is  used  to  identify  those  pairs 
of  points  no  further  apart  that  R.  In  words,  then,  the  average 
nearness  of  the  points  along  a  is  computed  as  the  sum  of  the 
contributions  of  all  pairs  of  points  no  farther  apart  than  R, 
such  that  the  closer  the  points,  the  greater  their  contribution 
to  the  double  sum  defined  in  (2).  Locating  a  direction,  a, 
that  maximizes  (1)  therefore  amounts  to  locating  a  direction 
that  shows  a  configuration  of  well  separated,  dense  clusters 
—  an  “interesting”  projection.  The  extension  of  (1)  and  (2) 
to  fwo-dimensional  projections  is  straightforward.  Data  vari¬ 
ability  across  the  plane  defined  by  (o,  0)  is  defined  simply  as 
s{q)s{i3)  and  point  density  is  measured  just  as  in  (2)  with  r,j 
defined  as  the  Euclidean  distance  between  pairs  of  points  on 
the  projection  plane. 

2.2  Searching  for  Interesting  Projections 

Given  that  any  arbitrary  projection  can  be  evaluated  according 
to  its  degree  of  interest,  a  mechanism  must  be  found  to  locate 
interesting  projections  in  the  p-dimensional  data  space.  If  one 
imagines  a  two-dimensional  grid  encompassing  all  possible 
two-dimensional  projections,  then  the  v^ues  obtained  from 
the  clottedness  function  define  a  third  dimension  that  is  a  sur¬ 
face  over  the  grid  of  possible  projections.  In  this  context,  the 
search  for  interesting  projections  amounts  to  a  search  for  local 
maxima  along  this  surface.  A  standard  approach  to  problems 
of  this  sort  involves  choosing  an  initial  starting  point,  choos¬ 
ing  a  “step”  size  and  then  varying  q  and  0  by  steps  until  a 
new  point,  “uphill”  of  the  previous  point,  is  located.  The 
application  of  numerical  optimization  procedures  of  this  type 
to  PP  is  described  in  some  detail  in  [4]  and  [5]. 

23  Summary 

PP  is  an  effective  approach  for  uncovering  structure  in  multi¬ 
variate  data.  The  analyst  need  not  specify  a  model  in  advance, 
estimation  takes  place  in  low-dimension^  context  (thus  avoid¬ 
ing  the  “curse  of  dimensionality”),  projections  that  reveal 
structure  can  be  cheaply  applied  to  new  data,  and  multiple 
informative  projections  can  often  be  found.  Unfortunately, 
the  projections  located  by  PP  can  often  be  difficult  to  interpret 
[8].  In  addition,  the  numerical  optimization  procedures  might 
locate  spurious  structure  [3]  or  fail  to  locate  real  su^clure. 
The  latter  possibility  is  the  subject  of  this  paper. 


3  Motivation 

Data  obtained  from  IBM’s  now  infamous  RANDU  pseudo¬ 
random  number  generator  are  often  described  as  the  kind  of 
data  to  which  PP  techniques  might  be  applied.  The  RANDU 
generator  has  the  propeny  that  any  three  consecutively  gener¬ 
ated  numbers  satisfy  Xn+2  -  6x„^i  +  9x„  =  0  (mod  1),  and 
so  the  niplets  lie  on  15  parallel  planes  through  the  unit  cube. 
These  planes  are,  however,  visible  only  over  a  narrow  “squint 
angle”  (less  than  5®)  and  it  has  been  suggested  that  PP  meth¬ 
ods  could  be  used  to  located  two-dimensional  projections  that 
reveal  the  planes.  However,  Buja  and  Stuetzle' ,  state  that: 

“The  RANDU  planes  do  suggest  several  ques¬ 
tions  about  PP.  First,  it  seems  doubtful  that  any 
vea-sion  of  PP  would  pick  up  the  planes...even  if  the 
sample  estimate  of  the  projection  index  had  min¬ 
ima  at  projections  which  show  the  RANDU  planes, 
the  valleys  might  be  much  too  narrow  to  be  found 
by  conventional  optimizers...in  spite  of  being  pro¬ 
moters  of  PP  methods  ourselves,  we  are  not  quite 
convinced  that  this  example  makes  a  strong  case 
for  PP.  On  the  opposite,  it  might  highlight  some 
unresolved  problems.”  (italics  mine) 

This  statement  prompted  the  work  described  in  this  paper — 
an  investigation  designed  to  ascertain  whether  a  decidedly 
unconventional  optimizer  (a  Genetic  Algorithm)  could  be  ap¬ 
plied  in  a  PP  setting  in  order  to  locate  two-dimensional  pro¬ 
jections  that  reveal  the  RANDU  planes. 

4  Genetic  Algorithms 

GAs  typically  follow  a  standard  paradigm: 

•  define  an  encoding  scheme, 

•  generate  a  starting  population, 

•  evaluate  the  starting  population, 

•  reproduce,  recombine,  mutate  and  re-evaluate  until  some 
termination  criterion  is  met 

In  the  context  of  PP,  the  application  of  GAs  to  the  search  for 
interesting  projections  involves  a  straightforward  implemen¬ 
tation  of  this  paradigm.  Each  step  is  described  below. 

4.1  Define  an  Encoding  Scheme 

Note  that  a  projection  is  usually  represented  algebraically  as 
a  pair^  of  linear  combinations  of  the  original  p-dimensional 
data.  For  example,  a  starting  projection  might  be  composed 
of  a  =  .32x1  +  .45x2  - ...  -I-  .\2xp  and  0  =  .56xi  -  .77x2  -f- 
. . .  -f  .78xp.  We  transform  each  of  the  2p  parameters  into 


'On  p.  486  of  a  discussion  of  Huber’s  Projection  Pursuit  review 
paper  (2]. 

*To  simplify  things,  we’ll  assume  we  arc  projecting  onto  the 
plane,  so  that  our  solutions  will  always  be  2D-scanerpIois. 
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a  binary  string^  of  sufficient  length  to  represent  the  desired 
range^ .  Concatenating  the  Ip  bit-strings  then  delivers  a  single 
bit  string  representing  the  projection. 

4.2  Generate  a  Starting  Population 

Instead  of  starting  with  a  single  projection  and  searching  up¬ 
hill  from  there,  we  start  with  many  (hundreds,  even  thousands) 
randomly*  selected  projections.  Each  projection  is  then  trans¬ 
formed  into  the  bit-string  representation  defined  above. 

43  Evaluate  the  Starting  Population 

Associated  with  each  bit  string  is  an  index  of  merit  measuring 
how  “interesting”  that  projection  is.  For  our  experiments,  the 
index  of  merit  is  the  clottedness  function  shown  as  equation 
(1).  The  index  of  merit  is  applied  to  each  bit-string  in  the 
starting  population. 

4.4  Iterate 

G  As  make  use  of  a  biological  metaphor  in  which  we  imagine 
each  bit  string  to  be  a  chromosome  capable  of  combining  with 
another  chromosome  and  producing  offspring  that  share  the 
characteristics  of  each  parent.  The  following  five  steps  are 
repeated  until  some  termination  criterion  is  met; 

Selection:  Just  as  in  biological  evolution,  natural  selection  is 
the  guiding  force  towards  adaptation.  Reproduction  is  con¬ 
trolled  by  a  biased  “roulette  wheel”  in  that  the  probability  that 
a  bit  string  will  be  allowed  to  provide  a  copy  of  itself  to  the 
next  generation  is  proportional  to  its  index  of  merit — the  best 
projections  provide  multiple  copies  of  their  “genetic  material” 
to  the  next  generation,  whereas  the  worst  projections  do  not 
survive  to  the  next  generation  at  all. 

Recombination:  The  biological  metaphor  is  followed  once 
again,  as  the  collection  of  bit-strings  form  into  pairs,  and  each 
member  of  some  proportion*  of  the  pairs  exchanges’  a  se¬ 
quence  of  bits  with  the  other  member.  This  process  is  called 
crossover  and  mimics  the  exchange  of  genetic  material  be¬ 
tween  biological  chromosomes. 

Mutation:  To  ensure  that  genetic  diversity  is  maintained  (i.e. 
premature  convergence  on  local  maxima  is  avoided),  single 
bits  spontaneously  change  state  at  a  predefined  mutation  rate. 
Restructuring:  Selection,  recombination  and  mutation  serve 
to  generate  a  brand  new  population  of  projections.  However, 
unlike  the  starting  population,  the  pairs  of  directions  that  form 
each  projection  are  unlikely  to  be  orthogonal.  In  this  step,  a 
Gram-Schmidt  orthogonalization  is  applied  to  each  projec¬ 
tion  to  ensure  that  the  orthogonality  constraint  is  maintained. 

*A  special  transformatioii  called  grey  scale  encoding  is  used  to 
ensure  that  bit  strings  representing  close  numbers  are  similar. 

^For  example,  2"  =  2048,  so  an  11 -bit  siring  can  be  used  to 
represent  parameters  in  the  range  —1.024  <  ii  <  1.024. 

’Some  carefully  chosen  projections  (e.g.,  principal  component 
directions)  can  be  included  in  the  starting  population  if  desired. 
“This  proportion  is  called  the  crossover  rate. 

^An  exchange  point  is  selected  at  random. 


Figure  1;  Results  for  3D  RANDU  data. 

Re-E valuation:  Each  projecuon  is  evaluated  and  assigned  an 
index  of  merit. 

4.5  Summary 

When  used  for  numerical  optimization,  G  As  differ  from  more 
traditional  procedures  in  that  they  are  stochastic  in  nature, 
they  use  an  encoding  of  the  parameter  set  rather  than  the 
parameters  themselves,  they  start  with  a  collection  of  points 
rather  than  a  single  point,  and  they  use  a  simple  evaluation 
procedure  rather  than  computed  or  approximated  derivatives. 
Because  of  these  characteristics,  GAs  tend  to  be  exu-emcly 
simple  to  use  and  generate  multiple,  parallel  search  paths 
that  tend  to  locate  global  maxima  in  ill-tehaved  (multimcdal, 
discontinuous,  noisy)  search  spaces  where  more  traditional 
approaches  fail  [1].  An  additional  advantage  of  the  genetic 
approach  is  that  GAs  can  be  readily  implemented  on  fast 
parallel  hardware  when  large  problems  must  be  tackled  [10]. 

5  Results 

For  the  first  experiment,  a  dataset  consisting  of  .V=500  cases 
was  generated.  Each  case  consisted  of  three  consecutive  ran¬ 
dom  variates  genCTatcd  by  the  RANDU  generator.  Figure  1 
shows  a  plot  with  generation  number  on  the  a--axis  and  com- 
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Figure  2:  Results  for  4D  RANDU  data. 


puicd  cloliedness  index  on  the  y-axis.  The  plot  illustrates  how 
the  index  of  merit  associated  with  the  current  best  projection 
changes  with  time.  The  best  projections  at  selected  points, 
as  well  as  the  settings  for  the  GA  parameters*  are  also  illus¬ 
trated.  Note  that,  for  3D  data,  a  projection  clearly  revealing 
the  RANDU  planes  was  located  ^ter  only  18  generations’. 

For  the  second  experiment,  500  cases  of  4Z?  RANDU  variates 
were  generated.  In  addition,  the  size  of  the  starting  population 
of  was  increased  from  100  to  200.  Figure  2  illustrates  that  a 
good  projection  was  located  after  200  generations,  and  that 
subsequent  generations  were  only  slightly  better. 

When  5  £)  RANDU  data  was  generated,  the  size  of  the  starting 
population  was  increased  to  500  and  the  mutation  rate  was 
increased  to  0. 1 .  With  these  settings,  the  genetic  PP  algorithm 
was  able  to  locale  a  good  projection  after  500  generations. 
When  6D  data  was  generated,  however,  PP  was  unable  to 
locate  a  good  projection  even  after  10,000  lrials’°. 


‘Guidelines  for  good  parameter  settings  for  optimization  prob¬ 
lems  are  found  in  [9]  and  were  used  for  these  exper  ments. 

*This  took  four  minutes  of  CPU  time  on  a  68040  NeXT  worksta¬ 
tion,  ten  seconds  when  N  was  reduced  from  500  to  100. 

'®The  search  space  is  so  enormous  (for  a  squint  angle  of  10°, 
there  arc  10*  2D  projections!)  and  the  number  of  interesting  views 
so  small  that  it  is  hard  to  imagine  any  optimizer  doing  well  here. 


6  Summary 

The  experiments  described  in  this  p^r  clearly  demonstrate 
that  a  genetic  version  of  PP  can  readily  locate  projections  that 
reveal  the  RANDU  planes — even  in  the  exceptionally  difficult 
4Dand  SD  spaces. 

GAs  seem  well-suited  for  PP  not  only  for  their  good  jjerfor- 
mance  as  general  purpose  optimizers,  but  also  because  they 
are  able  to  generate  a  collection  of  interesting  projections  in¬ 
stead  of  the  one  (hopefully)  optimal  projection  relumed  by 
the  currently  used  search  techniques.  Current  implementa¬ 
tions  of  PP  get  around  this  problem  by  transforming  the  data 
to  remove  found  structure,  searching  for  additional  su^cture 
in  the  transformed  space,  and  repeating  until  no  more  struc¬ 
ture  can  be  found  [4].  This  iterative  approach  is  not  necessary 
when  using  GAs  for  search.  Finally,  when  fast,  parallel  im¬ 
plementations  of  GAs  are  available,  a  genetic  version  of  PP 
can  readily  be  applied  to  very  large  problems. 
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Abstract 

The  use  of  mixed  multilevel  orthogonal  arrays  in  robust 
design  has  gained  popularity  in  quality  improvement  areas  in 
recent  years.  We  have  investigated  tte  use  of  genetic 
algorithms  in  the  construction  of  such  arrays.  This  paper 
addresses  issues  encountered  in  formulating  the  problem 
(such  as  encoding  and  representation),  as  well  as  the  results 
of  this  application.  We  compare  this  technique  with  simulated 
annealing  which  we  published  previously. 

Introduction 

The  objective  of  engineering  design,  a  major  part  of 
research  and  development,  is  to  pr^uce  high  quality  products 
that  meet  customer  requirements.  Knowledge  of  scientific 
phenomena  and  past  engineering  experience  with  similar 
product  designs  and  manufacturing  processes  form  the  basis 
of  engineering  design  activity.  However,  a  number  of  new 
decisions  related  to  the  particular  product  must  be  made 
regarding  product  specification,  parameters  of  the  product 
design,  Ae  process  design,  and  parameters  of  the  manufac¬ 
turing  process.  A  large  amount  of  engineering  effort  is 
consumed  in  conducting  experiments  (either  with  hardware 
or  by  computer  simulation)  to  generate  the  information  needed 
to  guide  these  decisions.  Robust  design  promoted  by  Dr. 
Genichi  Taguchi  is  an  engineering  methodology  for 
improving  productivity  during  research  and  development  so 
that  high-quality  products  can  be  produced  quickly  and  at  low 
cost  (Taguchi,  1986). 

Robust  design  draws  on  many  ideas  from  statistical 
experimental  design  to  plan  experiments  for  obtaining 
dependable  information  atout  variables  involved  in  making 
engineering  decisions.  Robust  design  makes  heavy  use  of 
orthogonal  arrays.  Robust  design  adds  a  new  dimension  to 
statistical  experimental  design.  It  helps  engineers  to  reduce 
economically  the  variation  of  a  product’s  function  in  the 
customer’s  environment.  Robust  design  also  ensures  that 
decisions  found  to  be  optimum  during  laboratory  experiments 
will  prove  to  be  so  in  manufacturing  and  in  customer  envi¬ 
ronments. 


A  matrix  experiment  consists  of  a  set  of  experiments 
where  we  change  settings  of  the  various  product  or  process 
parameters  we  want  to  study  from  one  experiment  to  another. 
Conducting  matrix  experiments  using  orthogonal  arrays 
allows  the  effects  of  several  parameters  to  be  determine 
efficiently  and  is  an  important  technique  in  robust  design. 

In  this  paper,  we  describe  using  genetic  algorithms  to 
generate  mixed  multilevel  orthogonal  arrays,  without  the  need 
to  resort  to  complex  combinatorics  theory.  This  is  a  contin¬ 
uous  effort  in  exploring  novel  optimization  techniques  for 
generating  gener^  orthogonal  arrays.  We  have  reported  on 
the  use  of  simulated  annexing  for  such  a  purpose  in  a  previous 
paper  (Wang  and  Safadi,  1990). 

Genetic  Algorithms:  What  are  they  and  how 
they  work. 

Genesare  essentially  blueprints  or  maps  that  contain  many 
segments  that  are  responsible  for  the  way  the  parts  of  a  living 
species  appear  and  function.  These  genes  are  modified  during 
tlK  evolutionary  process  by  means  of  reproduction  and 
mutation,  where  genes  that  are  responsible  for  attributes  in 
the  organism  that  help  it  survive  are  carried  over  with  greater 
probability  into  the  next  generation  (survival  of  the  Httest). 
Now,  if  one  thinks  of  survival  of  the  fittest  as  an  optimization 
problem,  with  the  genes  mapping  variables  that  are  respon¬ 
sible  for  the  value  of  an  objective  function  (the  fitness  of  the 
organism),  then  this  evolution  process  should  lead  to  values 
of  these  variables  that  optimize  the  objective  function.  A 
genetic  algorithm  is  a  procedure  that  mimics  the  evolution 
process,  and  uses  bit-strings  (strings  of  1  ’s  and  O’s)  as  genes 
to  represent  values  of  the  independent  variables,  in  which 
these  bit-strings  undergo  the  changes  that  genes  undergo 
during  evolution.  Thus,  bit-strings  that  represent  a  large  (high 
fitness)  valueof  the  objective  function  will  survive,  eventually 
giving  us  a  solution  to  the  optimization  problem.  The  fol¬ 
lowing  is  a  list  of  stqis  that  a  simple  genetic  algorithm  could 
follow  (Goldberg,  1989)(also  see  Figure  1): 
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1 .  Mapping  of  the  variables  into  genes  (encoding). 

2.  Marking  of  the  genes  with  probabilities  for  their  partici¬ 
pation  in  the  reproduction  process  based  on  the  value  they 
cause  the  objective  function  to  have. 

3.  Reproduction  (Collection  of  a  gene  pool  for  mating). 

4.  Mating  of  the  genes  (crossover). 

5.  Mutation. 

In  order  to  illustrate  these  steps,  we  shall  formulate  the 
problem  of  mixed-multilevel  orthogonal  array  generation, 
encode  it  in  the  aforementioned  binary  form  (genes),  then 
apply  the  algorithm. 


Orthogonal  array  generation 

The  problem  statement  for  the  generation  of  an  orthogonal 
array  (OA)  is  as  follows: 

Given  F,  factors  at  Z-,  levels,  Fj  factors  at  Lj  levels,  levels, 
... ,  factors  at  Lf,  levels,  generate  the  orthogonal  (balanced) 


F  F  F 

array  L, ' xLj’x ... xLyv"  Which  is  a  matrix  consisting  of 

columns  that  contain  the  various  levels  for  each  factor.  The 
orthogonality  requirement  is  met  if  and  only  if  the  following 
is  true: 

1.  Within  each  column,  there  must  be  an  equal  number  of 
occurrences  of  each  level  setting. 

2  Rows  having  a  particular  number  of  level  settings  in  a 
certain  column,  must  have  an  equal  number  of  all  other 
level  settings  in  the  rest  of  the  columns. 

3.  The  number  of  rows  in  the  matrix  should  be  the  minimum 
that  achieves  the  above  conditions. 

To  generate  the  OA,  we  first  generate  an  unbalanced  array, 
then  use  simulated  annealing  to  balance  it.  To  generate  the 
initial  unbalanced  array,  the  following  steps  are  taken: 

1.  Satisfy  condition  3  above.  The  minimum  number  of  rows 
Ng  is  the  lowest  common  multiplier  of  the  following  list: 
LJori  =  \  L^foriJ  =  l,N-,L-  LJor  i  =  \,NifF^>\ 

2.  Find  the  number  of  occurrences  of  each  level  for  each 
particular  column  (condition  I  above).  If  the  number  of 
levels  in  that  columns  is  L,  then  the  number  of  occurrences 
for  each  of  these  levels  in  that  column  is: 

NgIL, 

3.  Fill  the  columns  with  the  appropriate  number  of  levels  as 
calculated  above. 

This  would  give  us  the  initial  unbalanced  mau-ix.  To 
illustrate,  consider  the  array  3'  x  2^,  which  gives  the  following 
initial  (unbalanced)  matrix  shown  in  Figure  2.  The  minimum 
number  of  rows  is  lcm(2,3,4,6)=12.  The  balanced  matrix  is 
given  in  Figure  3. 

To  speed  up  the  computation  time,  we  balance  the  array 
one  column  at  a  time.  First,  the  first  column  is  fixed,  the 
algorithm  is  performed  on  the  second  column  to  balance  it 
with  the  first  using  condition  3  above.  Once  this  column  is 
balanced,  the  algorithm  is  performed  on  the  next  while  trying 
to  balance  it  with  the  previous  two  columns.  This  is  repeated 
until  the  whole  array  is  balanced. 


11111 
22222 
3  1111 
1  2222 
2  1111 
32222 
11111 
22222 
3  1111 
1  2222 
2  1111 
32222 


Figure  2.  The  unbalanced  3'  x  2*  matrix 
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1  1  1  2  * 
112  11 
12122 
12212 
21112 
21222 
22121 
222  1  1 
3  1111 
3  1222 
32112 
3222  1 


Figure  3.  The  balanced  3'  x  2*  matrix 


The  Algorithm 

We  will  now  illustrate  the  five  steps  of  the  algorithm  on 
one  of  the  columns  of  an  array. 


5.  Mutation:  The  function  of  mutation  is  to  help  prevent  the 
algorithm  from  being  trapped  in  local  minima.  Mutation 
happens  infrequently  and  to  a  random  chromosome  in  a 
randomly  selected  gene. 


Table  2.  Gene  Fitness  and  Segment  assignments 


Gene 

# 

Fitness 

Segment 

Probability  of 
Reproduction 
(%) 

4 

0.00  -  0.04 

4.0 

2 

10 

0.04-0.14 

10.1 

3 

15 

0.14  -  0.29 

15.2 

4 

20 

0.29  -  0.49 

20.2 

5 

50 

0.49  -  1.00 

50.5 

1.  Encoding:  An  encoding  scheme  should  insure  that  all 
possibilities  for  a  column  configuration  can  be  represented  by 
it.  We  designed  the  encoding  scheme  illustrated  in  Figure  4. 
The  encoding  gene  in  this  case  specifies  a  series  of  switching 
operations  to  be  performed  on  a  fixed  initial  column  in  order 
to  arrive  at  the  column  that  the  gene  encodes.  In  Figure  4,  we 
have  a  column  of  6  rows,  and  an  encoding  gene  with  13 
"chromosomes",  each  of  which  represents  a  combination  of 
two  rows  in  the  column.  To  arrive  at  the  column  that  the  gene 
is  actually  encoding,  we  switch  the  rows  of  the  column  whose 
corresponding  chromosomes  in  the  encoding  gene  have  a 
value  of  1 .  For  example,  chromosome  #1  has  a  value  of  0,  so 
rows  1  and  2  will  in  the  column  will  not  be  switched. 
Chromosome  #2,  however,  has  a  value  of  1 ,  so  rows  1  and  3 
will  be  switched.  This  process  is  repeated  for  all  row 
combinations. 

2.  Assigning  genes  probabilities  for  reproduction:  Table 
1  shows  a  list  oi  genes  and  their  corresponding  fitness,  and 
based  on  that  fitness,  a  segment  of  real  numbers  between  0 
and  1 .  The  idea  is  that  the  larger  the  fitness,  the  larger  the 
correspwnding  segment,  and  the  larger  the  probability  (Figure 

3)  that  a  random  number  between  0  and  1  will  fall  in  that 
segment,  which  is  the  way  the  genes  are  chosen  for  mating 
and  reproduction.  This  method  insures  higher  probabilities 
of  reproduction  for  genes  with  higher  fitness. 

3.  Selection  of  the  mating  pool:  A  random  number  is 
generated,  and  the  gene  that  corresp)onds  to  the  segment  into 
which  this  random  number  fits  is  selected  to  be  a  member  of 
the  mating  pool.  This  is  repeated  as  many  times  as  the  number 
of  genes  in  the  initial  gene  pool. 

4.  Mating  and  crossover:  After  building  the  mating  pxx)i, 
genes  from  this  pxx)l  are  paired  randomly,  and  crossover  will 
take  place  between  them.  In  other  words,  a  segments  of  the 
same  (random)  number  of  chromosomes  are  chosen  from 
random  locations  in  the  mating  genes  and  exchanged  (Figure 

4) . 


Figure  2.  Gene  Encoding. 
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Figure  3.  Gene  Mating  probabilities. 


Figure  4.  Crossover 


Conclusion 

We  have  used  a  genetic  algorithm  to  construct  mixed 
multilevel  orthogonal  arrays.  These  arrays  are  quite  useful  in 
robust  design  and  quality  improvement  projects.  The  use  of 
this  novel  search  and  optimization  technique  allows  us  to 
generate  these  arrays  without  resorting  to  complex  combi¬ 
natorial  techniques  which  caii  also  be  restrictive. 
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Abstract 

Three-dimensional  graphs  of  matliematical/ 
statistical  models  are  useful  for  the  understanding  of 
many  phenomena.  However,  even  with  sophisticated 
expensive  computer  hardware  and  software,  realistic 
rendering  of  3-dimensional  graphs  is  still  a  difficult 
task.  For  scientists  with  a  limited  budget,  it  is  close 
to  impossible.  A  simple  inexpensive  approach  is  to 
construct  3-D  surfaces  with  LEGO  bricks.  Steps 
include:  (a)  simulate  data  (calculate  outputs,  i.e.,  z 
values  from  a  design  matrix  of  x,  y  inputs)  for  the 
surface  via  a  favorite  computing  language  or  package; 
(b)  roundoff  data  to  the  resolution  of  individual 
bricks,  and  tabulate  data  on  sheets  of  paper,  one 
sheet  for  each  brick  layer;  (c)  construct  a  wooden 
plathirm  with  axis  tic  marks  and  labels;  and  (d) 
construct  the  3-D  surface  with  bricks.  Examples  of 
useful  models  which  have  been  constructed  include 
(mes  of:  (a)  synergism  for  two  anticancer  drugs;  (b) 
antagonism  for  two  anticancer  drugs;  (c)  a  composite 
generalized  nonlinear  model  consisting  of  a  logistic 
dose-response  structural  model  with  a  binomial  data 
variation  model;  and  (d)  a  likelihood  function 
associated  with  the  fitting  of  a  monoexponential 
pharmacokinetic  model  to  data  with  two  estimable 
parameters,  illustrating  profile  likeliho(xl.  These  3-D 
LEGO  models  are  useful  for  (a)  studying  the  shape  of 
3-D  functions;  (b)  gaining  insight  into  physical 
phenomena;  (c)  explaining  concepts  in  statistical 
analysis  approaches;  and  (d)  designing  experiments. 


Introduction 

The  visualization  of  3-dimensional  (3-D) 
mathematical /statistical  models  is  important  in  many 
branches  of  applied  and  theoretical  mathematics  and 
statistics.  Such  visualization  is  very  difficult  however, 
even  with  expensive  computer  graphics  capabilities.  It 
is  especially  difficult  to  represent  3-dimensional 
surfaces  as  2-dimensional  (2-D)  static  images.  Tricks 
such  as  shading,  shadowing,  motion,  stereoscopic 
hardware  and  holography  may  provide  some 
assistance  in  3-D  visualization,  but  none  of  these 
approaches  are  ideal,  and  most  are  expensive. 

Driven  by  the  need  to  intimately  understand 
3-D  concentration-effect  surfaces  for  my  research  in 
Pharmacometrics,  and  constrained  by  the  limits  of  a 
small  budget,  1  constructed  four  3-D  graphs  with 
LEGO  and  LEGO-compatible  building  blocks.  A 
picture  of  me  with  three  of  these  mcxiels  is  shown  in 
Figure  1.  They  are  of  a  suitable  size  for  classrix>m 
teaching,  one-on-one  tutoring,  and  contemplative 
thinking.  When  carefully  packed  in  boxes  with 
styrofoam  beads,  they  are  easily  transported  by  car 
and/or  plane.  The  approach  which  I  used  to  construct 
these  models  is  quite  general,  and  should  be 
applicable  and  useful  to  a  wide  variety  of 
mathematical/statistical  topics  for  both  research  and 
teaching  purposes.  This  article  describes  the 
construction  and  use  of  these  models. 


'Supported  by  NCI  grants  CA46732,  CA16().‘i6,  and 
CA2I()71. 


3-D  Response  Surfaces  with  LEGO  Bricks  'ill 


Figure  1.  A  proud  man  and  his  models. 


Methods 

Each  3-D  mathematical /statistical  function 
was  first  simulated  with  custom  FORTRAN 
programs.  The  3-D  array  of  points  was  then  printed 
out  on  a  2-D  table  with  the  values  of  the  X  and  Y 
variables  listed  along  the  top  and  leftside  of  the  table, 
and  with  the  values  of  the  Z  (height)  variable  listed  in 
the  cells  of  the  table.  The  Z  values  were  rounded  to 
the  nearest  LEGO  brick. 

Using  the  2-D  table  as  a  guide,  each  of  the 
models  was  constructed  on  a  standard  10  in  by  10  in 
LEGO  base  .  The  heights  of  the  models  varied  from 
11  to  22  standard  LEGO  bricks.  (Each  brick  is  0.375  in 
high.)  Models  were  constructed,  one  layer  at  a  time  of 
a  uniform  color,  with  each  adjacent  layer  (or  set  of 
layers)  being  a  different  color.  Black  bricks  were  often 
use  to  highlight  important  contours.  The  judicious  use 


of  color  is  a  great  aid  to  the  thorough  understanding 
of  the  3-D  model  by  the  viewer. 

For  each  model,  a  wood  base,  13  in  bv  13  in 
by  4.5  in  was  constructed,  and  covered  with  black 
laminate.  (A  simpler  base  made  from  one  piece  of  0.5 
inch  pressed  board  or  pK'wood  would  be  adequate.) 
Axis  labels  with  tic  marks  were  made  to  the  proper 
scale,  were  laminated,  and  then  glued  to  the  wood 
bases.  Finally,  the  LEGO  models  were  glued  onto  the 
bases. 


Description  of  Models 

A.  Synergism.  Figure  2  shows  a  concen¬ 
tration-effect  surface  for  two  drugs,  DD.4THF  (5,10- 
dideazatetrahydrofolate)  and  trimetrexate,  and  a 
response  which  is  the  growth  of  cells  in  a  cell 
culture  assay,  expressed  as  a  percent  of  control 
growth.  The  details  for  this  in  vitro  cancer 
chemotherapy  experiment  are  described  elsewhere 
(Greco  et  al,  1990).  Equation  1  was  fit  to  the  data  in 
Figure  1  with  iteratively  reweighted  nonlinear  least 
squares.  The  best  fit  surface,  shown  in  Figure  1,  was 
constructed  with  SAS/GRAPH  (SAS  Institute, 
1990).  The  mathematical/statistical  details  of  the 
nature,  origin  and  use  of  Equation  1  have  been 
published  elsewhere  (Greco  et  al,  1990;  Greco  and 
Lawrence,  1988;  Greco,  1989;  Svracuse  and  Greco, 
1986). 

Briefly,  Equation  1  allows  the  slopes  of  the 
concentration-effect  curves  for  the  two  drugs  to  be 
unequal.  A  convention  used  in  Equation  1  is  that  as 
drug  concentration(s)  increases,  the  measured 
response  decreases;  the  slope  parameter,  m,  is 
negative.  The  output,  £  ,  is  the  measurement  from 
the  cell  growth  assay;  and  the  inputs  are  [TMTXJ, 
[DDATHFJ,  the  respective  concentrations  of  TMTX 
and  DDATHF.  The  seven  estimable  parameters 
include:  Econ,  the  control  or  maximum  response  at  0 
drug  concentration;  B,  the  extrapolated  background 
response  at  infinite  drug  concentration;  IC^q  jmjX' 

^^50,DDATHF  '  median  effective  concentrations  of 
TMTX,  DDATHF  respectively;  ^TMTX’ 
^DDATHF,  slope  parameters  for  TMTX,  DDATHF 
respectively;  and  a,  the  svnergism-antagonism 
parameter.  V\Tien  a  is  positive  s\Tiergism  is 
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indicated,  when  a  is  negative  antagonism  is  model  in  Figure  3,  and  on  the  computer  screen  in 

indicated,  and  when  a  is  0,  no  interaction  or  Figure  3,  is  the  best  fit  surface  for  data  from  an 

additivity  is  indicated.  The  magnitude  of  a  is  experiment  in  which  cells  were  exposed  to  2  pM  folic 

algebraically  related  to  the  degree  of  bowing  of  acid  in  addition  to  the  drugs,  trimetrexate  and 

isobols  (contours  cut  through  the  surface  at  specific  DDATHF.  The  parameter  estimates  were:  Econ  = 

response  levels);  a  larger  a  will  result  in  a  larger  0.787  response  units;  B  =  0.0213  response  units; 

degree  of  bowing.  ><^50, DDATHF  =  3.91  nM;  triQQ^jf-iF  =  -3.91; 

^^50,TMTX  ~  ^^TMTX  ~  -216,  and  a  =  4.68. 

The  specific  surface  shown  in  Figure  2,  the 
LEGO  model  in  the  lower  left  of  Figure  1,  the  LEGO 

^  _ _ ITMTXI _  ^ _ I  DDATHF] _ 

f  e-B  f  E-B  ]^/'^DDATHF 

'^50,TMTX  [  Econ-E  ‘^50,DDATHF  [  Econ-E J 

_ alTMTXIIDDATHF] _ 

(  E-B  ]^/^>^TMTX  (  E-B  )  1/2'«DD/ITHF 

'^50JMTX  '^50, DDATHF  [Econ-Ei  [Econ-E] 


Figure  2.  A  3-D  surface  '^f  Equation  1  with  parameter  values  listed  in  the  text,  constructed  with  SAS/GRAPH. 


%  Ef 


Figure  3.  Comparison  of  3-D  LEGO  model  with  the  same  surface  generated  on  a  computer  screen. 
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Figure  4.  General  scheme  for  the  dissection  of  a  generalized  nonlinear  model  into  random  and  structural 
components  for  a  concentration-effect  cur\'e  for  a  single  drug. 
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B.  Anatclgoilisin.  The  LEGO  model  on  the  lower 
right  of  Figure  1  is  for  drug  antagonism.  The  surface 
was  simulated  using  the  generic  form  of  Equation  1, 
for  which  drug  1  and  drug  2  are  designated  as  Dj  and 

D2,  and  with  Econ  =  100;  /C50  j  =  1  |xM;  IC^q  2  -  ^ 
)jM;  mj  =  -2;  m2  =  -1;  a  =  -1;  and  6  =  0.  Note  the 
saddle  shape  of  the  surface. 

C.  Generalized  Nonlinear  Modeling.  Figure  4 
is  a  pictoral  dissection  of  a  generalized  nonlinear 
model  (McCullagh  and  Nelder,  1983)  into  structural 
and  random  components.  The  heavy  black  sigmoid 
curve  is  the  structural  component,  and  was  simulated 
from  the  lower  equation  in  Figure  4.  In  this  equation, 
p  is  the  expected  (mean)  response;  D  is  the 
concentration  of  drug;  Dm  is  the  median  effective 
dose  (same  as  !C^q);  and  m  is  the  same  slope 
parameter  as  in  Equation  1.  For  the  simulated  curve. 
Dm  =  10  and  m  =  -1.  The  three  distributions  shown  in 
Figure  4,  the  normal,  a  modified  binomial  and  a 
modified  Poisson  distribution,  represent  possible 
random  components  of  a  complete  generalized 
nonlinear  model.  For  the  modified  binomial 
distribution,  the  equation  is  shown  in  the  upper 
portion  of  Figure  4,  where  Y  =  k/n,  and  k  is  the 
number  of  successes,  n  is  the  number  of  tries;  mean  = 
p;  variance  =  p(l-p).  The  graph  of  the  binomial 
distribution  in  Figure  4  was  simulated  at  D  =  2.5  pM, 
with  p  =  0.80  and  n  =  5.  For  a  complete  description  of 
the  application  of  these  generalized  nonlinear  models 
to  concentration-effect  data,  see  Greco  and  Lawrence, 
1988  and  Greco,  1989. 

The  LEGO  model  being  held  in  Figure  1  is  a  3- 
D  representation  of  Figure  4,  with  the  same  sigmoid- 
logistic  structural  model,  the  same  modified  binomial 
model,  and  with  the  same  parameters.  However,  in 
the  LEGO  model,  the  binomial  distribution  is  shown 
all  along  the  length  of  the  sigmoid  curve. 

D.  Likelihood  Surfaces.  Figure  5  shows  a  3-D 
negative  log  likelihood  surface  for  the  following 
problem: 

Equabon  2  is  a  standard  monoexponential  1- 
compartment  pharmacokinetic  structural  model  with 
bolus  intravenous  injection  of  drug. 


Cp  =  /D/V/  exp(-/CiA// 1)  (2) 

Where:  Cp  is  the  concentration  of  a  drug  in  plasma;  D  is 
the  administered  dose  of  drug;  t  is  the  time  that  a  plasma 
sample  is  drawn  for  a  plasma  drug  level  measurement;  V  is 
the  volume  of  distribution  (a  parameter);  and  Cl  is  the 
plasma  clearance  of  the  drug  (a  parameter). 

A  set  of  hypothetical  data  is  as  follows  (D  =  1  pmole): 


time  (min) 

Cp  (pM) 

1 

0.904 

5 

0.607 

10 

0.500 

20 

0.135 

The  data  were  fit  with  the  nonlinear  regression  software 
package,  PCNONLIN  (Statistical  Consultants  Inc.,  1989), 
in  which  the  weight,  w,-,  for  each  point  was  equal  to  the 

reciprocal  of  the  square  of  the  predicted  plasma 
concentration,  with  the  sum  of  the  weights  forced  to  equal 
N,  the  number  of  data  points. 

1 

Weight:—  =  N 

i=l 

The  parameter  estimates  at  the  optimum  (minimum  of  the 

A  A 

objective  function)  were  V  =  0.896  +  0.15  (S.E.)  L,  and  Cl 
=  0.0909  ±  0.0093  (S.E.)  L/min. 


The  objective  function,  O,  shown  in  the  LEGO  model 
of  Figure  5  is  defined  in  Equation  3. 


N 


j=> 


N 


^Wj(Cpj  -  Cpj)‘ 


where:  s  = 


N-P 


(4) 


at  the  optimum. 

A  A 

Thus,  the  3-D  surface  in  Figure  5  has  V  as  the  X-axis,  Cl 
as  the  Y-axis,  and  O  as  the  Z-axis.  Note  the  irregular 
shape  of  the  negative  log  likelihood  bowl.  Asymptotic 
95%  confidence  intervals  are  usually  calculated  based 
upon  the  assumption  that  the  negative  log  likelihood 
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surface  is  a  regularly-shaped  ellipsoid  (football  shape) 
near  the  minimum.  The  optimal  values  for  the 
parameter  estimates  are  at  the  lowest  point  in  the 
negative  log  likelihood  bowl.  At  this  point, 

Og-pt  =  N  -  P  =  2  (from  combining  Equations  3  and  4). 

The  critical  value,  F(0.95,l,2) '  18.513;  and  thus  at 
0  =  20.513  (2  +  18.513),  the  black  top  layer  of  the  LEGO 
model,  one  can  calculate  95%  confidence  intervals  for  the 
parameters  via  profile  likelihood.  The  usual  asymptotic 
95%  confidence  intervals  for  V  and  CL  were;  0.263  to  1.53 
and  0.0508  to  0.131  respectively.  The  profile  likelihood 
95%  confidence  intervals  for  V  and  Cl  were  0.015  to  1.68 
and  0.005  to  0.198  respectively. 


Figure  5.  LEGO  model  of  a  negative  log  likelihood 
surface. 

Summary 

Lego  bricks  provide  an  inexpensive,  easy-to-use 
medium  for  constructing  useful  3-D  models  of 
mathematical  and  statistical  functions.  Many  high  level 
concepts  in  the  field  of  Statistics  can  be  communicated  to 
non-mathematically  trained  scientists  (as  well  as  to  more- 
mathematically  trained  scientists)  with  the  use  of  these 
tangible  models.  This  statistical  ideas  presented  in  this 
short  article,  could  have  been  much  more  clearly  and 


succinctly  presented  if  the  reader  could  have  seer,  and 
touched  the  LEGO  models.  In  a  paradoxical  sense,  this 
article,  filled  with  symbols,  numbers  and  equations,  is  the 
antithesis  of  the  point  which  I  would  like  to  emphasize;  the 
great  potential  for  excellent  graphical  models  to  improve 
both  teaching  and  research  involving  suiistical 
applications  and  theory. 
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The  year-to-year  growth  in  airline  passenger 
traffic,  accompanied  by  the  recent  introduction  of 
wide-bodied  airliners  to  handle  the  demand,  has 
caused  increasingly  severe  parking  problems  for  air¬ 
liners  at  terminal  gates.  An  algorithm  has  been  de¬ 
veloped  to  optimize  airplane  parking  configurations. 
The  algorithm  is  based  upon  dynamic  program¬ 
ming,  and  determines  a  parking  configuration  which 
maximizes  the  utilization  of  airliners  in  a  given  fleet 
mix.  It  solves  for  a  string  of  tokens  which  rep¬ 
resent  airplane  parking  maneuver  envelopes.  The 
envelopes  are  charac-  terized  by  a  discrete  collec¬ 
tion  of  possible  combinations  of  airplane  type,  air¬ 
line  ground  footprint,  parking  angle,  and  maneuver 
in  and  out  of  terminal  loading  configuration.  /The 
maneuvers  arc  predetermined  to  allow  independent 
access  to  each  terminal  loading  zone.  A  solution  is 
constrained  to  obey  width  and  parking  obstacle  con¬ 
straints  in  multiple,  overlapping  loading  zones,  in¬ 
cluding  corners.  The  complexity  is  a  low-order  poly¬ 
nomial  in  the  linear  extent  of  a  contiguous  string  of 
loading  zones. 

1  Introduction 

The  year-to-year  growth  in  airline  passenger 
traffic,  accompanied  by  the  recent  introduction 
of  wide-bodied  airliners  to  handle  the  demand, 
has  caused  increasingly  severe  parking  prob¬ 
lems  for  airliners  in  terminal  loading  zones. 
The  competition  for  limited  space  within  load¬ 
ing  zones,  makes  it  imperative  to  find  solu¬ 
tions  that  conserve  limited  parking  space  while 
obeying  FAA  airplane  clearance  requirements. 
Space-saving  solutions  that  allow  more  and 
larger  airliners  to  occupy  a  given  loading  zone 
simultaneously  can  be  of  such  importance  that 
designers  will  modify  the  terminal  building 
structure  to  acommodate  them. 


The  algorithm  presented  here  addresses  this 
problem  by  identifying  space-  saving  solu¬ 
tions  for  airplane  parking  within  loading  zones 
(Fig.  1).  Constraints  on  geometry  can  be  ac¬ 
counted  for  by  applying  simple  rules  to  the  two- 
dimensional  geometry  of  airplanes,  their  clear¬ 
ance  requirements,  loading  zones,  and  obstruc¬ 
tions  to  airplane  taxiing  and  independent  ac¬ 
cess  to  a  parking  space.  Once  the  constraints 
are  analyzed  and  converted  into  simple  geo¬ 
metric  quantities,  a  dynamic  programming  al¬ 
gorithm  [Gar72]  can  be  applied  to  find  solu¬ 
tions.  The  solutions  are  expressed  in  terms 
of  the  number  of  airplanes  of  each  of  one  or 
more  specified  types  (Boeing  767-200,  757-100, 
McDonnel-Douglas  DC-10,  etc.)  that  can  be 
parked  in  a  given  loading  zone  simultaneously. 
For  example,  an  airline  with  a  fleet  mix  of 
DC- 10s  and  757s  might  wish  to  park  as  many 
DC-lOs  as  possible,  and  then  to  park  as  many 
757s  as  possible  within  any  remaining  space. 
Smaller  airliners,  such  as  737s  or  DC-9s  (de¬ 
pending  on  the  fleet  mix  of  the  airline  involved) 
might  then  occupy  any  remaining  parcels  of 
loading  zone  space  if  there  is  sufficient  room. 

In  addressing  this  problem,  several  factors 
must  be  taken  into  account.  Normally,  load¬ 
ing  zones  are  constricted  on  at  least  one  side 
by  the  terminal  wall,  and  on  another  side  by 
a  taxi  lane  (Fig.  2),  from  which  airliners  en¬ 
ter  and  leave  the  loading  zone.  Entrance  and 
egress  occurs  along  a  prescribed,  well-marked 
path  in  both  the  taxi  lane  and  the  loading  zone. 
FAA  clearance  requirements,  such  as  wingtip 
clearances,  must  be  obeyed  at  all  times.  Thus, 
as  in  the  case  shown  for  La  Guardia  Airport  in 
Fig.  1,  an  airplane  can  taxi  along  a  progres¬ 
sively  narrower  taxi  lane  only  until  the  wingtip 
clearance  points  of  its  clearance  envelope  touch 
the  limit  lines  on  either  side  of  the  taxi  lane. 
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This  limits  the  range  of  parking  solutions  avail¬ 
able  to  an  airplane  of  a  given  type  in  a  given 
loading  zone,  quite  apart  from  any  restrictions 
imposed  by  the  geometry  of  the  loading  zone 
itself.  Normally,  airliners  pivot  sharply  and  en¬ 
ter  the  loading  zone  from  the  tajci  lane  at  close 
to  a  90  degree  angle  to  the  limit  line,  thus  ap¬ 
proaching  the  terminal  wall  head-on  (Fig.  3). 
For  various  reasons,  however,  they  may  come 
to  rest  in  a  loading  configuration  which  forms 
an  acute  angle  with  the  terminal  wall.  An  ex¬ 
ample  of  this  occurrence  is  shown  in  Fig.  2. 

Added  to  the  considerations  already  presented 
is  the  variation  in  airline  requirements.  These 
include  airliner  ground  footprint  requirements, 
which  are  specific  to  each  airline  for  each  air¬ 
plane  type  in  its  fleet.  These  requirements  con¬ 
sist  of  the  space  occupied  by  baggage  handling 
and  other  equipment  surrounding  an  airliner  in 
loading  configuration. 

2  Analysis 

The  airplane  parking  optimization  algorithm 
applies  a  form  of  dynamic  programming  to 
maximize  the  value  of  airplanes  parked  while 
obeying  limits  on  total  loading  zone  length.  An 
example  was  stated  in  Section  1,  in  which  the 
number  of  DC- 10s  parked  simultaneously  is  to 
be  maximized,  followed  by  757s.  In  dynamic 
programming,  a  solution  is  divided  into  a  se¬ 
quence  of  stages,  and  an  optimality  principle  is 
applied  at  each  stage  according  to  a  monotonic¬ 
ity  assumption:  If  the  cumulative  value  of  a 
partial  solution  at  any  stage  is  a  monotonically 
increasing  function  of  value  increments  at  each 
stage,  and  the  cumulative  cost  of  resources  to 
obtain  that  value  obeys  a  similar  relationship 
to  the  partial  cost  at  each  stage,  then  the  opti¬ 
mality  principle  allows  one  to  avoid  explicitly 
examining  every  possible  combination  of  alter¬ 
natives  for  all  stages:  the  alternatives  are  nar¬ 
rowed  down  at  each  stage  so  that  in  succeeding 
search  stages,  a  relatively  small  set  of  cumu¬ 
lative,  partial  solutions  are  maintained,  l  liese 
consist  of  only  those  whose  values  (costs)  would 
contribute  toward  the  total  value  (cost)  in  an 


optimal  solution. 

Before  the  parking  algorithm  can  find  solu¬ 
tions,  however,  the  airplane  clearances  and 
loading  zone  geometry  must  be  analyzed  to  de¬ 
rive  quantities  which  can  be  manipulated  by 
the  dynamic  programming  algorithm.  Hence, 
the  solution  method  consists  of  two  parts.  The 
first  part  is  a  geometric  analysis  phase,  in 
which  a  library  of  simple  interaction  rectan¬ 
gles  and  loading  zone  limits  is  created  from  the 
complexities  of  airport  geometry,  clearance  re¬ 
quirements,  independent  access,  airplane  type, 
and  airline  ground  equipment.  In  creating  the 
library,  it  is  often  prudent  to  assume  several 
different  loading  configuration  angles  relative 
to  the  terminal  wall,  to  allow  flexibility  in  find¬ 
ing  solutions  that  allow  an  optimum  mix  of  air¬ 
planes  to  be  parked  while  accounting  for  ob¬ 
structions  and  space  limitations  within  a  given 
loading  zone.  An  intermediate  result  of  the  ge¬ 
ometric  analysis  phase  is  a  library  of  airplane 
maneuver  envelopes,  each  envelope  being  de¬ 
fined  by  a  particular  combination  of  the  ge¬ 
ometry,  clearances,  and  loading  configurations 
analyzed  (see  Figs.  2  and  3). 

In  the  second  part  of  the  solution  method,  a 
dynamic  programming  algorithm  derives  solu¬ 
tions  by  finding  linear  arrangements  of  the  in¬ 
teraction  rectangles  within  the  extent  of  a  load¬ 
ing  zone.  The  loading  zone  can  cansist  of  mul¬ 
tiple  bands,  with  corner  slots  at  the  corners 
of  a  terminal  wall  (Fig.  4).  Each  band  and 
corner  slot  consists  of  overlapping  bands  which 
define  the  allowed  regions  for  parking  airplanes 
of  different  types  in  different  loading  configura¬ 
tions.  These  allowed  regions  depend  upon  taxi 
lane  obstructions  and  parking  strategy  used,  as 
previously  discussed.  Alternatively,  the  load¬ 
ing  zone  can  be  curved  (Fig.  5). 

User-specified  fleet  mix  assumptions  can  be  ap¬ 
plied  in  solving  the  dynamic  program:  for  ex¬ 
ample,  the  number  of  DC- 10s  available  at  a 
given  time  may  be  limited  by  scheduling  con¬ 
straints  at  the  airport  for  which  a  solution  is 
being  sought. 

The  individual  rules  for  the  geometric  analysis 
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of  airplane  parking  maneuvers  in  loading  zones 
are  simple,  but  applying  them  is  complicated 
by  their  interaction  and  the  need  to  provide 
flexibility  in  solving  the  optimization  scenarios. 
As  mentioned  before,  alternative  loading  con¬ 
figurations  must  be  considered  within  a  given 
loading  zone  for  each  airplane  that  could  con¬ 
ceivably  park  in  the  zone.  A  loading  zone  may 
have  multiple  limits  because  of,  among  other 
things,  taxiing  constraints  as  illustrated  in  Fig. 
2:  larger  airplanes,  or  loading  configurations  of 
even  medium-sized  airplanes  that  require  per¬ 
pendicular  entry  into  the  loading  zone,  do  not 
apply  beyond  the  point  at  which  the  taxi  lane 
is  too  narrow  to  allow  them. 

The  geometric  analysis  phase  is  further  com¬ 
plicated  by  the  following  consideration:  The 
cost  associated  with  a  trial  parking  solution 
is  the  amount  of  space  that  it  occupies  in  the 
loading  zone.  Suppose  that  a  trial  solution  at 
stage  N  of  the  dynamic  programming  algo¬ 
rithm  were  modeled  as  a  string  of  N  maneu¬ 
ver  envelopes,  beginning  at  a  specified  end  of 
the  loading  zone  under  consideration  (see  Fig. 
6,  in  which  the  string  is  actually  the  final  opti¬ 
mum  solution  derived  for  a  particular  loading 
zone  at  La  Guardia  airport).  The  envelopes  are 
packed  as  closely  as  possible  within  the  string, 
which  obeys  the  limit  constraints  of  all  overlap¬ 
ping  bands  within  the  loading  zone.  The  com¬ 
plication  is  that  if  maneuver  envelopes  define 
the  dynamic  programming  stages,  the  space- 
utilization  cost  of  a  string  cannot  be  measured 
by  summing  the  partial  costs  of  the  N  solu¬ 
tion  stages  (the  maneuver  envelopes  forming 
the  string).  This  is  not  a  well-defined  quantity, 
since  the  space  an  envelope  occupies  really  de¬ 
pends  upon  its  interaction  with  its  neighbors 
in  the  string.  This  problem  is  yet  further  com¬ 
plicated  by  the  subdivision  of  the  loading  zone 
into  overlapping  bands,  each  with  its  own  lim¬ 
its  on  airplane  maneuver  envelopes  (some  are 
excluded  altogether  from  a  given  band,  as  men¬ 
tioned). 

The  quantities  that  are  actually  measured  are 
the  lengths  of  interaction  rectangles  (Figs.  7 
and  8).  These  rectangles  are  defined  by  divid¬ 


ing  each  maneuver  envelope  along  a  selected 
“midline”,  and  then  deriving  the  length  of  each 
interacting  pair  of  envelopes,  measured  from 
midline  to  midline,  as  shown  in  Fig.  7.  A  so¬ 
lution  string  then  consists  of  interaction  rect¬ 
angles,  joined  so  that  the  maneuver  envelopes 
from  which  they  were  derived  are  reconsti¬ 
tuted.  Thus,  the  interaction  rectangles  have 
labelled  ends  corresponding  to  the  envelopes 
from  which  they  were  derived.  In  Figs.  7  and 
8,  the  end-labels  are  shown  as  simply  A,  B,  and 
C.  In  actuality,  they  are  indexed  to  indicate  the 
combinations  of  airline,  airplane,  loading  con¬ 
figuration,  and  parking  strategy  by  which  their 
constituent  maneuver  envelopes  were  derived. 
In  any  trial  solution  string,  the  constraint  that 
end-labels  must  match  is  applied,  thereby  re¬ 
constituting  the  maneuver  envelopes  (special 
rectangles  with  labels  matching  those  at  the 
ends  complete  each  end  of  the  string). 

3  Solution  Method 

The  dynamic  programming  algorithm  solves 
the  following  general  problem.  Let  the  in¬ 
teraction  rectangles  derived  in  the  geometric 
analysis  phase  be  f?ii,  /?j2,  ...,  RmN',  where 
i  and  j  in  Rij  are  indices  denoting  the  ma¬ 
neuver  envelopes.  Recall  that  each  maneuver 
envelope  is  defined  in  terms  of  airline  ground 
clearance  requirements,  FAA-specified  airplane 
clearances,  airplane  type,  and  the  geometry 
of  different  loading  zones,  taxi  lanes,  obstruc¬ 
tions,  and  the  possible  loading  configurations 
and  parking  maneuvers  under  consideration. 
Also,  some  of  the  indices  represent  the  ge¬ 
ometry  of  one  end  of  a  loading  zone,  serv¬ 
ing  a.s  a  starting  point  for  solution  strings. 
Thus,  M  and  N  differ  only  in  that  M  pro¬ 
vides  for  N  maneuver  envelopes  plus  the  re¬ 
quired  end-geometry  configurations,  assuming 
solution  strings  are  formed  from  left  to  right. 
Let  f(  Rij )  denote  the  length  of  the  interac¬ 
tion  rectangle  Rij.  Let  Ai,  A2,  . . . ,  At-  de¬ 
note  the  sets  of  indices  of  maneuver  envelopes 
which  correspond  to  each  of  the  T  airplane 
types,  and  let  It{t  =  1,2,  ...,T)  be  the 
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corresponding  number  of  copies  of  airplanes  of 
type  t  (Boeing  767-200,  for  example)  available. 
Finally,  let  Li,  L2,  . . . ,  Lk  be  sets  of  ordered 
pairs  {i,  j)  of  indices  corresponding  to  the  in¬ 
teraction  rectangles  that  are  allowed  in  each 
of  the  K  overlapping  bands  and  corner  slots 
in  the  loading  zone.  These  sets  are  determined 
from  the  original  maneuver  envelope  limits  dur¬ 
ing  the  geometric  analysis  phase.  Let  and 
Uk  be  the  lower  and  upper  bounds  on  interac¬ 
tion  rectangles  with  index  pairs  in  Lk. 

Let  ...  where  each 

^ik-iik  —  ^pg  some  indices  1  <  p  < 
M  and  1  <  q  <  N  {1  <  ik-i,ik  <  -5), 
denote  a  trial  solution  string  at  stage  5  of  the 
algorithm.  Let  |  Xf  \  denote  the  number  of 
airplanes  of  type  t  in  the  string  X^.  The  cost 
of  the  string  X^  is  its  total  length, 

s 

c(X*)  =  ^  ij.  (1) 

i=i 

Then  at  stage  5,  each  partial  trial  solution 
string  X^  must  obey  the  following  constraints: 

\Xf\  <  It  it  =  1,2,  ...,T)  (2) 

and 

(*5-1,  is)  €  Lk 

whenever 

4  <  c(A'^)  <  Uk  (k  =  1,2,  ...,  K).  (3) 

A  final  solution  string  must  maximize  the  de¬ 
sired  quantities  (number  of  DC- 10s  and  757s, 
for  example)  subject  to  (2)-(3). 

The  dynamic  programming  algorithm  avoids 
enumerating  all  possible  partial  strings.  At 
stage  5,  let  two  trial  solution  strings  be 
X^  =  Ri„i,Ri,i^  ...  Rig_^ig  and  = 

Ri'^i'^Ri’^i'^  ...  If  the  two  strings  X^ 

and  have  the  same  value  (contain  the  same 
number  of  airplanes  of  each  type  to  be  maxi¬ 
mized,  for  example)  and  if  the  end  maneuver 
envelope  indices  are  the  same  in  both  strings, 
p  =  i5  =  i'g,  then  the  only  advantage  one 


string  can  possess  over  the  other  in  determin¬ 
ing  an  optimum  is  lower  cost.  This  follows  be¬ 
cause  they  possess  the  same  value,  and  because 
at  stage  5  -f-  1,  both  strings  are  extended  by 
adding  interaction  rectangles  of  the  form  Rp^q 
for  all  q  such  that  (2)-(3)  are  satisfied  with  S 
replaced  by  5  -b  1.  That  is,  both  strings  of¬ 
fer  the  same  possibilities  for  adding  interaction 
rectangles  at  the  next  stage.  The  algorithm 
exploits  this  fact  by  eliminating  either  X^  or 
Y^,  depending  on  which  possesses  the  higher 
cost.  Elimination  on  pairs  is  performed  repeat¬ 
edly  until  the  stage  5  solution  strings  con¬ 
tain  no  such  pairs.  Thus,  only  the  lowest-cost 
strings  with  unique  properties  to  contribute  to 
a  final  solution  are  kept. 

If  there  are  Us  strings  remaining  following  the 
elimination  at  stage  5,  the  work  involved  in 
comparing  and  eliminating  pairs  at  stage  5-1-1 
is  at  most  UsN  (recall  that  there  are  N  ma¬ 
neuver  envelopes  worthy  of  consideration:  the 
other  M  -  N  account  for  string  start  geome¬ 
try).  Normally,  the  number  of  different  string 
values  to  consider  depends  upon  the  number  of 
at  most  two  airplane  types  in  a  string.  Thus, 
Ui  <  N  for  a  one-rectangle  string  A'*  = 
Riqi^,  and  U2  <  N  -  [{2  +  1)(2  +  2)/2]  since 
there  are  at  most  [(2  -|-  1)(2  -f  2)/2]  sums  of 
two  values  having  a  maximum  of  2  occurrences 
between  them  (either  airplane  type  may  occur 

O,  1,  or  2  times)  in  a  string  of  length  two.  Sim¬ 
ilarly,  there  are  [(5  -f-  1)(5  -|-  2)/2]  possible 
sums  of  two  values  having  a  maximum  of  5 
occurrences  between  them  in  a  string  of  length 
5.  Thus,  allowing  for  N  end-labels  (maneu¬ 
ver  envelope  indices  at  the  end  rectangle)  and 
[(5  -b  1)(5  -b  2)/2]  possible  string  values  for 
each  end-label,  the  number  of  strings  that  need 
be  maintained  at  stage  5  is  of  order  N.S‘/2). 
If  the  number  of  stages  necessary  to  eliminate 
all  but  the  highest-payoff  solution  strings  is 

P,  therefore  (at  which  point  no  more  interac¬ 
tion  rectangles  will  fit  within  the  loading  zone), 
then  the  total  number  of  comparisons  is  upper- 

p 

bounded  by  i(S  +  l)'/2),  a  polynomial 

k=2 

in  N  and  P.  Assuming  that  N  is  fixed  for 
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a  wide  range  of  optimization  scenarios  studied 
following  a  geometric  analysis,  and  note  that 
P  varies  with  loading  zone  length.  This  im¬ 
plies  that  the  complexity  of  the  optimization 
algorithm  is  a  low-order  polynomial  in  the  lin¬ 
ear  extent  of  the  loading  zones  involved  in  the 
scenarios. 
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Figure  1.  Airport  terminal  loading  zones. 


Terminal  Wall 


Figure  2.  Geometry  of  taxi  lane  and  loading  zone  in  parking. 
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Figure  4.  A  loading  zone  formed  from  overlapping  bands  and  comer  slots. 
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ANALYSIS  FOR  CURVED  WALLS 


Figure  5.  A  loading  zone  following  a  curving  terminal  wall. 


Figure  6.  An  optimum  solution  for  two  airplane  types  (large  and  small  pentagons) 
in  a  loading  zone  at  La  Guardia  Airport. 
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Figure  7.  Two  interaction  rectangles  created  from  maneuver  envelopes  A,  B,  C,  and  D. 
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Figure  8.  Two  interaction  rectangles  joined  to  form  part  of  a  string. 
Note  the  matching  of  labels  for  A. 
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Abstract 

Pollen  from  trees  and  other  plants  is  preserved  annu¬ 
ally  in  the  sediments  of  bogs,  ponds,  and  lakes,  forming 
a  record  of  contemporary  vegetation.  Geologists  sam¬ 
pling  pollen  preserved  over  the  past  20,000  years  have 
created  large  databases  whose  variables  include  latitude, 
longitude,  time,  abundance,  and  pollen  type.  Tradi¬ 
tional  methods  of  visualizing  these  data  include  pollen 
diagrams  (time  series  of  plant  taxa  abundances),  classifi¬ 
cation  diagrams,  and  contour  maps  of  pollen  abundance. 
While  these  are  valuable  tools,  modern  computer  graph¬ 
ics  technologies  offer  new  ways  of  visualizing  continu¬ 
ous  multidimensional  structures.  In  this  paper,  we  de¬ 
scribe  the  development  of  an  interactive  computer  graph¬ 
ics  package,  supported  by  IRIS-4D  workstations,  for  vi¬ 
sualization  of  higher-dimensional  aspects  of  the  pollen 
database  covering  eastern  North  America. 

Introduction 

New  methods  of  viewing  and  analyzing  data  in  a  given 
field  often  lead  to  theoretical  insights.  Such  have  arisen 
in  late  Quaternary  palynology,  the  study  of  pollen  pre¬ 
served  in  sediments  dating  to  18,000  years  ago  or  ear¬ 
lier.  An  excellent  introduction  to  existing  methods  in 
this  area  is  Birks  and  Gordon  (1985). 

Pollen  is  deposited  annually,  accummulating  with 
sediments  at  sites  such  as  bogs,  ponds,  and  lakes  which 
are  favorable  to  its  preservation.  The  raw  data  of  paly¬ 
nology  come  from  vertical  cores  or  sections  of  such  sedi¬ 
ment,  sampled  at  different  levels.  Under  microscopic  ex¬ 
amination,  individual  pollen  grains  in  a  sediment  sample 
are  counted  and  classified  taxonomically. 

Qualitative  descriptions  of  pollen  preserved  in  peat 
bogs  date  from  the  latter  part  of  the  nineteenth  century 
(Manten,  1967).  Quantitative  work  begins  with  Lennart 
von  Post  (1918,  in  Swedish;  reprinted  in  English  in  1967). 
Von  Post  recorded  the  percent  relative  frequency,  rather 
than  the  absolute  frequency,  of  each  taxon  of  interest 
in  the  sample.  Plotting  these  percentages  versus  sample 
depth,  he  collected  the  graphs  for  all  the  taxa  to  form  a 
pollen  diagram  for  the  site. 

‘Present  address;  Department  of  Mathematics,  Occi¬ 
dental  College,  Los  Angeles,  CA  90041. 


Pollen  diagrams  of  percentage  data  continue  to  be 
the  fundamental  form  of  data  representation  in  palynol¬ 
ogy.  Through  radioactive  isotope  dating,  stratigraphic 
comparisons,  and  other  methods,  depth  can  be  mapped 
to  time  with  a  resolution  of  from  100  to  250  years  in  most 
cases  (Webb,  1982).  Pollen  collecting  at  a  site  originates 
with  plants  in  the  surrounding  sampling  basin,  which 
varies  with  taxon,  site,  and  time  (Birks  and  Birks,  1980). 
Relative  rather  than  absolute  frequency  is  generally  the 
preferred  measure  of  pollen  abundance  due  to  robust¬ 
ness  (Mather  1972,  1980;  Davis  e<  al.,  1973),  despite  the 
negative  correlation  it  introduces  between  pollen  types 
in  a  sample.  Thus,  as  Von  Post  realized,  the  pollen  di¬ 
agram  of  a  site  approximates  the  vegetation  history  of 
the  surrounding  region.  Pollen  diagrams  have  become 
important  sources  of  data  for  paleoecological  and  paleo- 
climatological  studies  of  the  late  Quaternary  (Birks  and 
Birks,  1980;  Webb  et  al.,  1987). 

Pollen  data  may  also  be  viewed  as  points  in  an  ab¬ 
stract  pollen  space  whose  coordinate  variables  are  lati¬ 
tude,  longitude,  time,  abundance,  and  pollen  type.  Since 
the  1970’s,  computer-based  techniques  for  exploratory 
multivariate  analysis  have  provided  new  ways  of  visual¬ 
izing  pollen  data.  Classification  methods  (scaling  meth¬ 
ods,  cluster  analysis)  based  on  ‘dissimilarity’  measures 
have  been  the  most  popular  (Gordon,  1981).  These 
methods  generate  a  variety  of  diagrams  revealing  specific 
geometric  relationships  between  point  data.  An  impor¬ 
tant  application  has  been  finding  modern  analogues  to 
fossil  pollen  assemblages  (Overpeck  et  al.,  1985). 

The  spatial  resolution  of  pollen  data  depends  on 
both  the  density  of  core  sites  and  the  size  of  the  sam¬ 
pling  basins  for  each  pollen  type  and  site.  Different  pro¬ 
cesses  affecting  vegetation  are  also  evident  at  different 
scales.  By  interpolating  the  data  to  a  grid  of  the  app  ro- 
priate  scales  in  space  and  time,  vegetation  patterns  gen¬ 
erated  by  processes  acting  at  those  scales  emerge  (Pren¬ 
tice,  1988).  For  given  times  and  taxa,  maps  of  isopolls 
(contours  of  pollen  percentages)  can  be  used  to  visual¬ 
ize  these  patterns.  Introduced  by  Szafer  (1935),  isopoll 
maps  have  been  used  extensively  since  the  1970’8.  They 
have  been  especially  useful  at  the  subcontinental  scale, 
revealing  the  influence  of  climate  on  vegetation  (W’ebb, 
1988). 
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Mapping  methods  regard  data  for  a  given  pollen 
type  as  discrete  samples  of  structures  which  are  essen¬ 
tially  continuous.  This  perspective  can  be  extended  to 
higher  dimensions.  For  example,  isopoll  maps  are  tem¬ 
poral  cross  sections  through  a  3D  space-time  box  in  which 
isopolls  are  2D  surfaces  (Webb,  1988).  Selected  plots  of 
isopoll  surfaces  were  first  generated  in  1984  by  Stead 
and  Webb  (see  Banchoff,  1990,  pp.  82-83).  Advances 
in  computer  and  exploratory  graphics  since  that  time 
now  make  possible  a  much  richer  range  of  methods  along 
these  lines.  In  this  paper,  we  report  our  work  on  interac¬ 
tive  graphics  programs  for  visualizing  continuous  struc¬ 
tures  in  pollen  space.  Our  focus  is  on  understanding  both 
the  4D  character  of  a  given  pollen  type  and  interactions 
between  pollen  types. 

Data 

We  worked  with  twelve  taxa  from  a  late  Quaternary 
pollen  database  for  eastern  North  America  maintained 
at  Brown  University.  The  data  had  been  smoothed  and 
interpolated  to  a  space-time  grid  with  increments  of  1000 
years  in  time  and  about  100  km  in  space.  The  number 
of  origined  sites  increases  from  less  than  100  between  18 
and  12,000  years  ago  to  roughly  300  from  10,000  years 
ago  to  the  present. 

Some  auxilliary  data  were  also  provided.  Points  of 
the  grid  covered  by  the  receeding  continental  ice  sheet  of 
the  last  ice  age  were  indicated.  Continuous  coastlines  of 
the  present  and  18,000  years  ago  were  included  for  geo¬ 
graphical  orientation.  The  paleocoastline  Wcis  digitized 
using  GSMAP  (Seiner  and  Taylor,  1989).  Its  northern 
part  was  traced  from  Dyke  and  Prest  (1987).  Since  sea 
level  rose  as  the  ice  sheet  retreated,  the  southern  part 
could  be  approximated  by  tracing  modern  bathymetric 
maps.  The  modern  coastline  came  from  a  database  sup¬ 
plied  with  GSMAP. 

All  geographical  data  had  been  projected  onto  the 
plane  using  an  Albers  equal  area  projection  (see  Snyder, 
1987,  p.  383).  Further  details  concerning  this  database 
may  be  found  in  Webb  (1988). 

Design 

Software  design  is  a  compromise  between  functional  goals 
and  technological  constraints.  While  broad  goals  were 
clear  initially,  details  were  revised  continuously  during 
development.  Close  collaboration  between  users  and  de¬ 
velopers  contributed  greatly  to  the  success  of  this  project. 

The  central  problem  was  visualizing  continuous  ob¬ 
jects  in  4-space.  We  chose  to  solve  this  by  linking  3D 
slices  of  these  objects.  This  determined  the  basic  visual¬ 
ization  structure,  a  3D  box  containing  contour  surfaces. 


The  three  coordinate  cixes  for  each  of  the  boxes  are 
selected  from  the  four  continous  variables  in  the  dataset: 
latitude,  longitude,  time,  and  abundance.  Four  distinct 
boxes  of  this  sort  are  possible.  The  variable  not  chosen 
as  a  coordinate  determines  the  contour  surface.  For  ex¬ 
ample,  if  a  value  of  40%  is  chosen  for  abundance  in  the 
space-time  box,  the  isopoll  surface  enclosing  grid  points 
with  an  abundance  value  >40%  is  displayed.  Since  abun¬ 
dance  is  a  function  of  the  other  variables,  contour  sur¬ 
faces  in  boxes  with  abundance  as  a  coordinate  are  also 
function  graphs. 

Each  type  of  box  is  customized  to  enhance  interpre¬ 
tation.  Axes  are  labeled,  and  maps  are  drawn  on  the 
appropriate  faces  if  the  box  has  both  space  coordinates. 
In  the  space-time  box,  the  ice  sheet  may  be  displayed  as 
either  a  surface  or  point  cloud. 

The  exploratory  graphical  principles  of  linking  and 
focusing  guided  our  work  (see  Stuetzle  and  Buja,  1990). 
In  visualizing  large  multivariate  data  sets,  focusing  refers 
to  the  selection  of  views  or  subsets  of  data.  Linking 
involves  visually  relating  different  views  or  subsets. 

Focusing  is  accomplished  here  through  choosing  the 
boxes,  pollen  types,  and  surfaces  to  be  displayed.  Multi¬ 
ple  selections  may  be  displayed  simultaneously.  Surfaces 
in  3-space  cannot  be  visually  comprehended  from  a  single 
2D  view,  making  the  dynamic  capabilities  of  interactive 
graphics  essential.  A  box  may  be  rotated  in  3D  auto¬ 
matically  or  under  direct  control.  One  can  also  zoom  in 
on  specific  portions  of  a  box. 

Linking  is  used  for  data  comparisons.  Within  a 
given  box,  multiple  surfaces  can  be  selected  and  dis¬ 
played  simultaneously.  Wireframe  or  Gouraud  shaded 
solid  representations  in  a  variety  of  colors  and  simulated 
materials  can  be  selected  for  each  surface,  facilitating 
discrimination  between  multiple  surfaces.  Wireframes 
can  also  be  overlaid  on  solid  surfaces  to  create  a  tex¬ 
tured  appearance.  For  a  given  pollen  type  and  3D  box, 
the  implicit  variable  can  be  stepped  through  a  range  of 
values,  generating  an  animated  sequence  of  contour  sur¬ 
faces. 

Different  boxes  can  also  be  linked.  For  example,  a 
specific  time  can  be  highlighted  in  the  space-time  box, 
generating  the  corresponding  surface  in  the  space-abun¬ 
dance  box.  If  an  isopoll  sequence  is  then  animated  in  the 
space-time  box,  a  sequence  of  highlighted  cross  sections 
at  corresponding  abundance  values  will  be  animated  on 
the  surface  in  the  space-abundance  box. 

Users  control  the  programs  interactively  via  pop-up 
menus  and  appropriate  mouse  or  keyboard  input.  There 
are  several  recording  options.  A  given  image  may  be 
printed  or  stored  m  a  snapshot.  Entire  sessions  can  also 
be  recorded  for  later  playback. 
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Implementation 

This  software  was  developed  for  the  Silicon  Graphics 
IRIS-4D  series  of  workstations  with  version  3.3  of  the 
operating  system.  Of  the  platforms  available  to  us,  this 
one  offered  the  powerful  interactive  real-time  graphics  we 
required.  The  code  was  written  in  C  using  the  graph¬ 
ics  library  and  window  manager  documented  in  Silicon 
Graphics,  Inc.  (1990a,  1990b,  1990c).  It  is  highly  mod¬ 
ular  to  facilitate  continued  development  and  adaptation 
to  similar  data  sets. 

In  this  environment,  it  easiest  to  run  each  3D  box 
as  a  separate  program.  Because  of  customization,  sev¬ 
eral  different  programs  are  required.  Pollen  displays  the 
space-time  box,  while  Slice  runs  the  space-abundance 
box.  The  boxes  with  time,  abundance,  and  latitude  or 
longitude,  respectively,  as  coordinates  are  separate  op¬ 
tions  in  Paihview. 

Most  implementation  is  standard,  using  the  rich 
function  libraries  provided  by  Silicon  Graphics.  It  was 
necessary,  however,  to  find  a  way  of  polygonalizing  con¬ 
tour  surfaces.  A  script  language  was  also  developed  to 
simultaneously  solve  the  two  problems  of  recording  snap¬ 
shots  or  sessions  and  linking  boxes. 

For  contour  surface  polygonalization,  we  used  a  vari¬ 
ant  of  the  inarching  cubes  algorithm  (Lorensen  and  Cline, 
1987).  A  contour  value  partitions  vertices  in  a  3D  data 
grid  into  two  sets.  Contour  surfaces  separate  these  sets. 
Since  membership  for  each  vertex  can  be  determined  in¬ 
dependently,  suitably  constrained  contour  surfaces  can 
be  constructed  locally.  Lorensen  and  Cline  construct 
triangulated  surfaces  constrained  to  intersect  edge  mid¬ 
points.  Favoring  reduced  computation  at  the  expense  of 
a  somewhat  rougher  surface,  we  opted  instead  to  con¬ 
struct  triangulated  surfaces  contrained  to  intersect  ver¬ 
tices.  Each  data  cube  has  8  vertices,  so  there  are  2^  = 
256  distinct  ways  such  surfaces  can  intersect  a  cube.  For 
each  cube,  a  hash  look-up  table  is  used  to  determine 
what  surfaces  to  draw. 

A  script  language  proved  to  be  an  flexible  solution 
to  recording  and  communication  problems.  The  lan¬ 
guage  contains  all  the  possible  user  commands  along 
with  higher  level  programming  constructs.  In  snapshot 
mode,  a  file  of  script  commands  necessary  to  reconstruct 
the  current  view  is  generated.  While  in  record  mode,  all 
commands  executed  by  the  user  are  written  to  a  script 
file  which  can  be  replayed  later.  The  same  language  is 
used  to  exchange  commands  linking  two  programs. 


Discussion 

More  experience  is  needed  to  determine  how  best  to  use 
features  of  the  existing  software.  We  have  found,  for 
example,  that  the  choice  of  surface  attributes  is  criti¬ 
cal  when  viewing  multiple  objects  in  the  same  3D  box. 
Interpreting  the  results  of  linking  boxes  will  also  take 
practice.  We  need  systematic  ways  of  using  these  tools 
to  develop  higher-dimensional  intuition. 

Refinements  and  extensions  are  planned.  These  in¬ 
clude  improvements  in  contour  surface  polygonalization, 
interbox  communication,  and  the  user  interface.  The 
most  important  extension  is  to  Paihview.  The  user  will 
be  able  to  specify  a  curvilinear  geographical  transect 
which  will  take  the  place  of  latitude  or  longitude.  In 
this  way,  it  will  be  possible  to  visualize  the  effects  of 
specific  geographical  features. 

While  development  continues,  the  current  software 
is  beginning  to  be  used  for  both  research  and  teaching. 
It  has  been  found  that  certain  contour  values  for  each 
taxon  yield  distinctive  surface  forms.  In  some  cases, 
these  neatly  summarize  facts  which  had  already  been 
gleaned  from  map  sequences,  but  new  features  are  also 
emerging.  The  interesting  geometry  observed  thus  far  is 
also  motivating  the  development  of  mathematical  mod¬ 
els  which  capture  this  structure. 

Although  some  of  the  features  of  this  software  are 
specific  to  this  particular  pollen  dataset,  only  slight  mod¬ 
ifications  would  be  required  to  visualize  other  pollen 
data.  If  more  extensive  changes  were  made,  especially 
to  the  user  interface,  other  multivariate  data  could  be 
used.  Future  applications  include  visualization  of  rela¬ 
tionships  between  pollen  and  climatalogical  data. 
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Abstract 

The  standard  deviation  (or  equivalently  the  variance)  of  a 
sample  of  numbers  is  one  of  the  most  elementary  concepts 
in  statistics.  Yet  this  computation  harbours  a  number  of 
serious  difficulties,  especially  when  the  sample  is  large  and 
the  standard  deviation  is  small  relative  to  the  mean. 

This  contribution  will  describe  prototype  software  for 
both  didactic  and  production  use  to  allow  reliable  calculation 
of  sample  variances  (or  equivalently  standard  deviations), 
for  a  wide  variety  of  sample  sizes  and  data  characteristics. 
Several  illustrations  of  the  software  and  its  evaluation  will 
be  presented,  if  appropriate  accompanied  by  a  live 
demonstration. 

1  Introduction 

Computation  of  the  sample  variance,  or  equivalently  the 
sample  standard  deviation,  is  one  of  the  most  common  and 
fundamental  tasks  in  statistical  computation.  Indeed  it  is  so 
common  that  the  difficulties  it  may  present  are  often 
overlooked.  There  is  a  fairly  rich  literature  on  these 
difficulties  and  ways  to  overcome  them.  (Almost  all  the 
citations  at  the  end  of  this  paper  concern  this  topic,  and 
specific  references  will  be  placed  in  the  body  of  the  paper.) 
Nevertheless,  the  issue  continues  to  give  concern  (see,  for 
example.  Smith,  1991,  for  a  description  of  multiple 
complaints  with  the  standard  deviation  function  @STD  in 
different  versions  of  the  popular  spreadsheet  program  Lotus 
1-2-3). 

The  defining  formula  for  the  variance  (the  adjective 
"sample”  will  be  dropped  where  the  meaning  is  clear)  also 
provides  a  computational  algorithm.  Using  symbols  which 
are  suitable  for  incorporation  into  a  computer  program,  we 
first  calculate  the  (sample)  mean  as 

(1)  X_bar  =  {SUM  i:  =  l..n  :  X(i)}  /  n 

then  use  this  information  in  a  second  pass  through  the  data 
to  calculate  the  variance 


(2)  V(X)  =  {SUM  i:  =  l..n  :  [X(i)  -  X_barH  /  (n  -  1) 
and  hence  the  standard  deviation  is  computed  as 

(3)  SD(X)  =  sqrt(V(X)) 

The  issues  upon  which  this  contribution  is  based  are: 

1)  Accuracy  --  How  large  is  the  deviation,  either  absolute 
or  relative,  between  the  computed  variance  and  the  "true” 
value  which  would  be  obtained  if  we  made  no  error  in 
computation?  This  is  especially  important  when 

SD(X)  <  <  X  bar  (Chan  and  Lewis,  1978). 

2)  Efficiency  --  How  fast  is  our  calculation?  How  many 
basic  arithmetic  operations  are  we  required  to  perform,  and 
is  this  in  some  way  optimal?  In  particular,  we  would  like  to 
avoid  two  passes  through  the  data  if  the  data  set  is  large. 

3)  Complexity  --  Are  the  program  code  and  data  structures 
simple  and  straightforward,  or  do  we  need  complicated 
programs  which  require  very  careful  attention  to  many 
details?  Can  we  exploit  parallel  computational  facilities,  or 
partition  the  calculation  so  that  regional  offices  can  partly 
carry  out  the  calculations? 

4)  Education  ~  How  can  the  concerns  and  mechanisms  for 
responding  to  them  be  made  available  to  others?  How  can 
a  greater  awareness  of  the  difficulties  be  achieved? 

The  work  reported  here  is  part  of  a  long-standing  and 
ongoing  project  to  address  these  issues.  The  present 
contribution  is  directed  primarily  toward  providing  a 
prototype  computer  program  which  illustrates  most  of  the 
algorithms  which  have  been  proposed  to  compute  the  sample 
mean  and  sample  variance.  Some  ideas  are  presented  on  the 
preparation  of  "production”  codes  and  design  elements  of  a 
program  to  prepare  specially  formatted  test  data  sets  are 
discussed  briefly. 

2  Foundations 

For  convenience  we  will  define  the  quantities 

(4)  T  =  {SUM  i;  =  l..n:  X(i)}  and 
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(5)  S  =  {SUM  i:  =  l..n  :  [X(i)  -  X_bar]"} 

This  mirrors  Chan,  Golub  and  LeVeque  (1983),  who  give 
a  decision  table  for  selecting  an  appropriate  algorithm.  The 
present  work  could  be  viewed  in  part  as  providing  an 
illustration,  within  a  single  program,  of  the  ideas  contained 
within  this  decision  table.  They  also  include  a  survey  of 
various  error  analyses  which  have  been  carried  out  for  the 
different  algorithms  used  to  compute  S  or  V(X);  readers 
may  wish  to  note  that  there  are  some  minor  differences  in 
detail  between  the  formulas  which  have  been  published  in 
different  reports.  For  consistency,  we  have  computed  error 
measures  in  our  program(s)  based  on  the  formulas  of  Chan 
et  al.  (1983). 

From  a  didactic  point  of  view,  error  analyses  can  be  dry 
and  tedious  enough  that  the  important  messages  they  carry 
may  be  overlooked.  In  the  present  application,  some  of 
these  messages  are: 

-  that  the  desire  to  overcome  two  passes  through  the 
data  may  tempt  users  to  employ  calculation  methods  which 
throw  away  information  which  is  present  in  the  data,  such 
as  the  popular  but  dangerous  "Textbook"  algorithm.  This  is 
based  on  the  algebraically  equivalent  forms  for  S 

(6)  S  =  {SUM  i:  =  l..n  :  X(i)^}  -  n  *  X_bar^ 

=  {SUM  i:  =  l..n  :  X(i)^}  -  (r)  /  n 

-  that  loss  of  information  frequently  occurs  because  the 
difference  of  two  (often  large)  nearly  equal  numbers  is 
calculated,  causing  digit  cancellation.  The  Textbook 
algorithm  can  be  seen  to  encourage  such  a  subtraction, 
especially  when  the  data  elements  X(i)  are  of  a  comparable 
magnitude. 

-  that  a  large  n  (large  data  sets)  may  cause  inaccuracies 
on  the  computation  of  T,  or  equivalently  the  mean  X  bar, 
if  the  accumulation  is  performed  by  adding  the  data 
elements  one  at  a  time  into  an  accumulator.  Similar 
difficulties  may  occur  when  updating  methods  are  used  to 
calculate  S.  This  motivates  the  use  of  pairwise  summation. 

3  The  computer  program  VARIANCE 

The  working  prototype  program  VARIANCE  has  been 
developed  and  was  demonstrated  at  Interface  '9 1 .  The  goals 
of  its  development  are: 

1)  to  show  how  the  various  approaches  to  variance 

calculation  work  and 

2)  to  allow  different  methods  to  be  applied  to  different 

test  data  sets. 

Furthermore,  VARIANCE  is  designed  with  a  unified 
program  structure  so  that  all  variations  are  included  within 
a  single  program,  with  no  extra  code  to  include  or  remove. 


VARIANCE  has  the  following  features: 

a)  It  has  been  programmed  in  Borland’s  Turbo  Pascal, 
version  5.0,  under  the  MS-DOS  operating  system.  The 
commented,  and  hopefully  readable,  source  code  occupies 
47K  bytes  and  the  executable  form  39K.  The  author  intends 
to  distribute  it  as  share-ware  or  at  nominal  cost  as  a  self¬ 
teaching  or  classroom  demonstration,  which  should  interest 
a  wide  audience.  Extended  precision  accumulation  has  been 
avoided  to  maintain  some  semblance  of  conformity  to  other 
variants  of  Pascal. 

b)  VARIANCE  currently  includes  5  main  algorithms 

-  the  Textbook  algorithm 

-  the  standard  Two-Pass  defining  algorithm,  with 
automatic  calculation  of  the  Bjorck  (Chan  et  al. 
1979)  correction  terms 

-  West’s  (1979)  updating  method 

-  the  Pairwise  algorithm  of  Chan  et  al.  (1979) 

-  Cotton’s  (1975)  updating  method 

c)  Where  appropriate,  summations  may  optionally  be 
performed  in  a  pairwise  manner. 

d)  The  program  and  the  data  structure  described  below 
allow  a  set  of  data  to  be  partitioned  into  blocks.  This 
reflects  the  possibility  of  computation  in  parallel  by 
multi-processor  computing  systems.  Alternatively,  we  may 
think  of  data  collected  and  partially  processed  by  separate 
agents.  The  results  for  separate  blocks  may  be  combined  by 
direct  summation  or  by  an  extension  of  the  pairwise 
updating  formula  of  Chan  et  al.  (1979)  discussed  below. 

e)  All  data  may  be  shifted  (or  coded),  that  is,  a  constant 
may  be  subtracted  from  each  data  element  within  a  block  of 
data.  The  strategies  allowed  for  shifting  are: 

-  No  shift 

-  Fixed  shift  (for  all  blocks)  entered  by  the  user 

-  Sample  the  first  data  block,  with  a  user-supplied 
sample  size,  and  use  the  mean  of  the  sample  as  a 
(fixed)  shift  for  all  blocks 

-  Sample  each  block  of  data,  with  a  user-supplied 
sample  size,  and  use  the  mean  of  the  sample  for 
each  block  as  the  shift  for  that  block. 

f)  Operation  counts  of  real  and  integer  arithmetic, 
assignment  (storage)  operations,  and  control  decisions  (IF  or 
WHILE  or  CASE  or  UNTIL)  are  recorded  and  displayed  at 
various  points  in  the  program. 

g)  A  number  of  options  for  control  and  information 
display  are  included  such  as: 

-  interactive  or  batch  operation 

-  treat  a  data  set  as  a  single  block 

-  pause  for  user  response  after  each  data  element 

-  display  intermediate  results 

h)  error  bounds  from  the  error  analyses  reported  by 
Chan  et  al.  (1983)  and  others  are  reported  with  the 
computed  mean  and  variance 
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i)  control  infomution  may  be  obtained  from  a  file  for 
"hands  ofr  operation 

j)  results  are  optionally  saved  in  a  Ale  which  is  in  a 
form  suitable  for  use  as  an  input  file  above 

k)  execution  is  timed. 

4  Trial  data  sets 

The  program  VARIANCE  takes  as  input  data  a  simple  text 
file  having  the  following  structure: 

a)  Comment  lines  may  appear  anywhere  in  the  data  set 
and  begin  with  a  special  character  in  the  first  position  of  the 
line.  Currently  the  exclanuition  mark  (!)  is  used  as  the 
comment  character.  We  feel  it  is  important  that  the  number 
and  position  of  conunents  not  be  restricted  so  that  full 
documentation  of  data  sets  may  be  provided.  For  exan:^)le, 
comment  lines  may  contain  'exact*  results,  or  give  timing 
or  control  information  for  specific  computing  platforms. 

b)  Numerical  data  is  provided  in  text  form,  currently  1 
number  per  line  for  simplicity. 

c)  Blocks  (there  must  always  be  at  least  1)  are  ended 
with  a  line  consisting  of  the  word  ENDBLOCK 

d)  The  data  file  is  ended  with  a  line  consisting  of  the 
word  ENDDATA.  Clearly  this  is  not  needed,  but  documents 
the  end  of  the  data  clearly. 

An  outline  of  a  program  to  generate  test  data, 
VARDATA,  has  been  prepared.  The  principles  behind  this 
program  are  that  it 

a)  build  data  files  in  the  format  described  above 

b)  be  easily  extended  as  new  requirements  are  stated 

c)  use  both  fixed  and  pseudo-random  series,  and 
different  distributions  for  pseudo-random  series 

d)  allow  different  scalings  and  shifts  to  be  applied,  in 
particular  to  force  roundings  so  that  machine  internal 
representations  of  data  are  inexact 

e)  provide  exact  results  information  in  the  form  of 
comments  in  the  data  sets. 

We  believe  that  there  is  a  need  for  several  classes  of 
data  sets:  1)  didactic  sets  to  illustrate  the  difficulties  and 
peculiarities  of  variance  calculation  methods;  2)  test  data 
sets  which  permit  relatively  rapid  validation  of  codes  and 
checking  of  details  of  implementations;  3)  very  large  test 
data  sets  so  that  production  codes  can  be  exercised  and 
timed.  The  last  class  may  best  be  generated  when  needed  so 
long  as  the  generation  process  is  reliable. 

5  A  generalized  combining  formula 

Suppose  we  have  two  sets  of  data  to  which  the  following 
information  applies: 


Set  a 

Set  b 

Number  of  Elements 

n. 

n. 

Shift  used 

K 

kb 

Sum  of  data  from  shift 

sT. 

sTb 

Sum  of  deviations^ 

s. 

Sb 

(Note  that  the  shift  is  irrelevant  to  these  sums.) 

sTe  =  {SUM  i: 

=  1  .. 

n*  :  X,(i)  -  k,} 

We  now  wish  to  compute  the  combined,  unshifted 
values  for  n,  T  and  S.  Clearly 

(7)  sT,  =  T,  -  n.  *  k* 

(8)  T  =  T,  +  Tb  =  sT,  -f-  sTb  -1-  n.  k,  -b  rib  kb 

The  combining  formula  of  Chan  et  al.  (1979)  may  be  cast 
in  various  forms 

(9a)  S  =  S,  +  Sb  +  Qri,  where 

(9b)  Q.b  =  (iibT,  -  nJb)^  /  (n.nb(n.-l-nb)) 

=  n,nb(T,/n,  -  Tb/ftb)^  /  (n.+nb) 

=  (n.nb/(n,+nb))  (X_bar(a)-X_bar(b))^ 

=  [(sT,+n,kJ/n,-(sTb+nbkb)/nb]^  (n,nb/(n.+nb)) 
=  n,nb  [sT./n,  -  sTb/Ob  +  (k,-lq,)]^  /  (n,+nb) 

The  last  two  forms  are  the  generalizations  for  shifting.  Since 
Chan  et  al.  developed  the  formulae  mainly  for  use  in  the 
pairwise  algorithm,  they  did  not  need  the  extension  to 
shifted  data.  Note  that  using  a  conunon  shift  for  both  blocks 
allows  us  to  ignore  the  shifts.  However,  we  may  wish  not 
to  do  this.  As  far  as  the  author  is  aware,  a  full  error 
analysis  has  not  been  completed  for  the  pairwise  algorithm. 
In  the  program  VARIANCE,  the  bounds  conjectured  by 
Chan  et  al.  (1983)  have  been  used.  No  extra  analysis  for  the 
extended  formula  involving  shifts  has  been  incorporated  to 
date. 

6  Production  codes 

The  didactic  program  VARIANCE  allows  a  user  to  examine 
the  properties  of  different  algorithms  acting  on  various  data 
sets.  This  is  useful  in  selecting  a  method  appropriate  to  a 
particular  class  of  data  sets  so  that  the  accuracy  of  the 
results  may  be  controlled.  The  operation  counts  (and  timing 
where  user  interaction  is  not  required)  suggest  the  relative 
efficiencies  of  algorithms,  but  do  not  offer  a  very  precise 
measure  of  the  timing  which  may  be  obtained  under 
real-world  operating  conditions,  where  it  is  likely  that  data 
retrieval  will  dominate  the  timing. 
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Any  production  version  of  a  mean  and  variance 
calculation  program  requires  attention  to  the  details  of 

-  data  transfer  from  storage  to  the  calculation  program  and 
intermediate  storage  of  sample  data; 

-  efficient  coding  of  algorithms  in  the  chosen  programming 
language  —  we  do  not  believe  that  the  didactic  code  is 
necessarily  appropriate  without  modification; 

-  extended  length  arithmetic  for  accumulation  of  sums; 

-  placement  of  control  and  timing  functions  so  that  they 
interfere  as  little  as  possible  with  the  computations; 

-  handling  of  missing  data; 

-  handling  of  multiple  variables  at  one  time; 

-  computing  covariances  (and  hence  correlations). 

7  Ongoing  M'ork 

Despite  the  fundamental  nature  of  variance  computation,  a 
number  of  tasks  remain  to  be  completed.  First,  VARIANCE 
needs  more  thorough  validation,  improved  commentary  and 
documentation,  and  careful  adaptation  to  different  computing 
platforms.  At  Interface  ’91,  Rich  Heiberger  made  a  number 
of  useful  suggestions,  one  of  which  is  that  restricted  length 
arithmetic  would  be  helpful  in  demonstrating  the  failure  of 
the  Textbook  algorithm.  Second,  the  data  generation 
program  VARDATA  needs  flesh  on  the  skeleton.  Third, 
some  example  production  codes  and  applications  to 
real-world  data  should  be  prepared.  Collaboration  in  such 
development  would  be  most  welcome;  indeed  it  is  critical 
for  the  third  task  for  the  provision  of  the  applications. 
Interested  parties  should  contact  the  author. 
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Abstract 

KEYFINDER  is  a  menu-driven  Prolog  program  that  assists 
statisticians  in  the  difficult  task  of  generating  blocked  and/or 
fractional-replicate  expoimental  designs  in 
highly-constrained  situations.  Designs  are  constructed  from 
sets  of  generators  called  “design  keys”.  A  depth-first  search 
algorithm  builds  keys  which  yield  designs  matching  detailed 
uso'  specifications.  Design  parametos  include  the  number  of 
experimental  units  and  the  numbers  of  levels  of  the  various 
block  and  treatment  factors.  Block  factors  may  be  combined 
into  row-and-column,  crossed  or  nested  (split-plot) 
arrangements.  The  user  can  also  specify  the  orders  of 
treatment  interactions  that  must  remain  (a)  unaliased  with 
treatment  main  effects  and  (b)  unconfounded  with  blocks; 
further  options  are  available  to  ensure  that  specific 
higher-order  interactions  of  interest  also  remain  estimable. 
Keys  are  used  to  generate  balanced  designs  in  which  all  the 
block  and  treatment  factors  have  numbers  of  levels  which  are 
powers  of  the  same  prime  number.  Direct-product  facilities 
allow  the  user  to  combine  keys  in  different  primes  and  thus 
produce  totally  asymmetrical  plans.  Procedures  are  provided 
for  the  corect  randomization  of  all  experimental  plans 
generated,  in  accordance  with  the  block  structure. 
KEYFINDER  is  implemented  on  IBM  PCs  and  PS/2s  and, 
more  effectively,  on  SUN  3  and  SUN  4  workstations. 
Executable  copies  of  Version  1  of  the  program  are  available 
on  request  from  the  author,  free  of  charge. 

1.  Introduction 

There  are  a  vast  number  of  computer  programs  currently  on 
the  market  fw  the  statistical  analysis  of  experimental  data, 
but  relatively  few  for  the  equally  important  area  of 
experimental  design.  Those  design  systems  that  have 
appeared  in  recent  years  have,  in  the  main,  been  "expert 
systems”  targeted  at  non-statisticians.  Examples  include 
CADEMO  (Rasch  et  aL,  1987),  DESIGN-EASEt**  and 
DESIGN-EXPERT^*"  (STAT-EASE  Inc.,  Minneapolis, 
MN),  DESIGN  EXPERT  (WiUiams,  1991), 
EXPERdMENTAL  DESIGN™  (SlausUcal  Programs, 
Houston,  TX)  and  SELINA  (Baines  et  ai,  1986, 1988).  Very 
few  programs  are  available  to  help  the  professional 


statistician  with  those  parts  of  experimental  design 
construction  that  are  difficult  or  laborious.  Only  PROC 
FACTEX  and  PROC  OPTEX  in  SAS®  (SAS  Insutute  Inc., 
Cary,  NC)  readily  spring  to  mind. 

KEYTTNDER  (Zemroch,  Lunn,  Baines  and  Clithero,  1989; 
Zemroch,  1990,  1991)  is  a  menu-driven  Prolog  program  for 
generating  blocked  and/or  fractional-replicate  experimental 
designs.  The  program  uses  general  algorithms,  not  stored 
catalogues,  to  produce  designs  so  that  plans  can  be 
constructed  in  arbitrary  and  quite  complex  situations. 
KEYFINDER’s  major  strength  is  its  ability  to  generate 
designs  with  user-deHned  confounding  and  aliasing  patterns. 
This  makes  the  program  an  invaluable  aid  to  statisticians 
needing  to  produce  designs  in  the  real  world  of 
highly-constrained  experimentation.  Details  of 
KEYFINDER’s  implementation  are  given  in  Section  2. 

Designs  are  consuucted  from  sets  of  generators  called 
“design  keys”  (Patterson,  1965;  Patterson  and  Bailey,  1978); 
these  are  described  in  Section  3.  The  aliasing  and 
confounding  properties  of  a  design  are  readily  deduced  from 
its  key,  but  the  methodology  has  yet  to  win  widespread 
acceptance  because  of  the  difficulty  of  reversing  the 
deductive  process.  Writing  down  sets  of  generators  to  yield 
designs  with  predeHned  properties  is  a  non-trivial  task  in  the 
g^eral  case.  Indeed  this  is  a  search  process  which  is  ideally 
suited  to  computerization.  KEYFINDER  finds  keys  matching 
detailed  user  specifications  using  a  depth-first  search 
algorithm;  an  outline  of  this  is  given  in  Section  4.  The 
algorithm  is  published  in  Zenuoch  et  al.  (1989)  and  is  more 
general  than  its  predecessor,  the  earlia-  KEYGEN  procedure 
of  Zemroch  (1986,  1988),  and  the  pioneering  algorithm  of 
Franklin  (1985). 

KEYFINDER  uses  keys  to  generate  a  wide  range  of  designs 
with  a  variety  of  block  structures.  Direct  product  facilities 
allow  sets  of  keys  to  be  combined  together  giving  greater 
flexibility  in  design  dimensions.  Nevertheless  not  every 
useful  design  can  be  obtained  from  design  keys  arxl  the  direct 
product  method  and  so  the  system  is  currently  being 
expanded  to  incorporate  other  design  classes.  Full  details  of 
the  designs  covered  are  given  in  Section  5. 

Sir  R.A.  Fisher  (1935)  first  realized  the  necessity  of 
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randomizing  experimental  designs.  This  must  be  done  in  due 
accordance  with  block  strucuire.  Section  6  describes  the 
randomization  and  smting  facilities  provided  in 
KEYFINDER.  The  paper  concludes  with  a  discussion  of 
KEYFINDER ’s  performance  in  practice  (Section  7). 

2.  The  KEYFINDER  program 

The  KEYFINDER  program  has  facilities  for  (i)  the 
construction,  storage  and  retrieval  of  design  keys,  (ii)  the 
subsequent  generation  and  storage  of  the  associated 
experimental  plans,  (iii)  the  randomization  and  sorting  of 
experimental  plans*,  (iv)  the  construction  of  direct-product 
designs*,  and  (v)  the  execution  of  DOS  or  UNIX  system 
commands. 

KEYFINDER  is  written  in  Prolog,  Prolog  being  chosen 
because  of  its  pattern  matching  facilities  (via  unification), 
automatic  backtracking,  and  richness  of  representation  (lists, 
structures  and  a  built-in  database),  combined  with  its  ability 
to  generate  and  execute  code  (see  Clocksin  and  Mellish, 
1984,  or  Bratko,  1986).  A  menu  system  has  recendy  been 
provided  to  spare  the  user  the  tedium  of  providing  sequences 
of  complex  mulu-parameto’  Prolog  queries  in  order  to 
produce  designs.  The  user  simply  has  to  select  one  of  a 
number  of  options  at  each  stage,  or  provide  a  single  piece  of 
information.  Sensible  defaults  are  provided  as  appropriate. 
The  Main  Menu  is  illustrated  below; 

KEYFINDER  -  Version  3.03  -  Main  Menu 

1.  Current  deiign  7.  Design  genermtian,  randomization 

2.  Declare  number  of  uniu  and  sorting 

3.  List  generators  8.  Form  direct  product  design 

4.  Construct  design  key  9.  Execute  system  command 

5.  Display  design  key  P  Exit  to  Prolog 

6.  Save/retrieve  design  key  Q  Quit  KEYFINDER 


Executable  implementations  of  KEYFINDER  have  been 
developed  for  IBM  PCs  and  PS/2s,  using  SD  Prolog  (Quintec 
Systems  Ltd.,  Oxford),  and  for  SUN  3  and  SUN  4 
workstations,  using  Quintus  Prolog  (Quintus  Computer 
Systems  Inc.,  Mountain  View,  CA).  Version  1  of 
KEYFINDER  was  first  released  in  1989  and  it,  and  its  new 
User  Manual  (Zemroch,  1991),  are  available  from  the  author, 
free  of  charge.  The  demonstration  at  Interface '91™  will 
include  many  of  the  new  facilities  to  appear  in  the  next 
release  and  these  enhancements  are  discussed  where 
appropriate  in  the  present  paper. 


3.  Design  keys 

The  most  important  design  construction  method  in 
KEYFINDER  is  the  “design  key”,  invented  by  Patterson 
(1965)  and  discussed  further  by  Patterson  and  Bailey  (1978), 
Zemroch  (1988)  and  Zenuoch  et  al.  (1989).  A  design  key  is  a 
set  of  equations,  e.g. 

P  =  U,  Q  =  V,  ...  ; 

A  =  UV,  B  =  W,  C  =  UW,  D  =  VW . (1) 

relating  the  block  factors  P,  Q, ...  and  treatment  factors 
A,  B. ...  to  a  set  of  q  p-level  “plot  factors”  U,  V, ...  indexing 
the  p^  experimental  units  (p  prime).  The  levels  of  the  block 
and  treatment  factors  for  each  unit  i  are  generated  from  the 
known  levels  of  the  plot  factors  by  the  equivalent  equations 

Pi  =  Uj,  qi  =  Vi,  ...  ;  ai  =  Ui+  v^,  bi  =  Wi,  ...  (modp) . (2) 

The  aliasing  and  confounding  properties  of  a  design  are 
readily  deduced  from  its  key:  if  p  =  2  in  key  (1),  for  example, 
then 


CD  =  UW.VW  =  u’ 2)  _uv  =  A 

BC  =  W.UW  =  U  =  P.  . (3) 

The  CD  interaction  is  thus  “aliased”  with  the  treatment  factor 
A.  This  means  that  it  will  be  impossible  to  disentangle  the 
effect  of  treatment  A  from  the  CD  interaction  in  the 
subsequent  analysis  of  the  experimental  data.  Therefore  the 
generated  design  should  only  be  used  when  the  scientist  is 
confident  a  priori  that  the  CD  interaction  will  not  manifest 
itself  in  his  experiment.  The  BC  interaction  is  similarly 
“confounded”  with  the  block  factor  P. 

Each  generator  U,  V,  UV, ...  in  the  design  key  (1)  represents 
p-1  of  the  available  p**-!  available  d.f.  (degrees  of  freedom). 
A  4-level  factor  needs  3  d.f.  and  thus  requires  3  generators, 
c.g. 


A  =  (U  V  UV), 

these  forming  a  subgroup  under  multiplication  (minus  the 
identity  I).  The  4  combinations  of  levels  of  the  plot  factors  U 
and  V  give  the  4  levels  of  A. 

The  key  for  a  Vs-replicate  3^  design  in  9  units  might  be 


A  =  U,  B  =  V,  C  =  UV2.  . (4) 


Here  the  plot  factors  U  and  V  each  have  3  levels  and  so  the 
terms  U,  V,  UV  (unused)  and  UV^  each  represent  2  d.f.  The 
design  points  are  computed  as 


•  Not  available  in  Version  1 . 


Hi  =  Ui,  bj  =  Vj,  Cj  =  Uj  +  2vi  (mod  3). 


(5) 


350  PJ.  Zemroch 


4.  Search  algorithm 

In  KEYFINDER,  design  keys  are  tailor-made  to  the  user’s 
specification.  The  user  first  inputs  the  dimensions  of  his 
design,  i.e.  the  number  of  points,  the  block  structure  (see 
Section  S)  and  the  numbers  of  levels  of  the  various  block  and 
treatment  factors.  If  the  design  is  to  be  a  fractional-replicate, 
the  “resolution”  must  then  be  specified.  This  determines  the 
degree  of  aliasing  which  is  to  be  permitted.  In  a  resolution  r 
design  (r  ^  3),  treatment  main  effects  are  mutually  orthogonal 
but  may  be  aliased  with  interactions  of  order  r-1  and  above 
(e.g.  A  =  BCD  if  r  =  4).  Higher  resolution  designs  are  thus 
more  robust  against  unexpected  interactions  than  lower 
resolution  ones,  but  generally  need  more  design  points  to  test 
the  same  number  of  factors.  If  the  design  is  to  be  blocked,  the 
user  must  also  specify  the  “confounding  limit”  c  (c  ^  1).  This 
determines  the  permitted  degree  of  confounding:  treatment 
main  effects  and  interactions  of  order  c  and  below  must 
remain  unconfounded  with  blocks. 

The  KEYFINDER  program  searches  for  a  key  matching  the 
user’s  specification  using  a  depth-first  search  procedure.  First 
the  p-level  plot  factors  U,  V,  W, ...  are  given  values  so  that 
each  combination  of  levels  of  U,  V,  W, ...  uniquely  identifies 
one  of  the  p5  experimental  units  (p  prime).  Then  a  list  of 

candidate  generators,  U,  V,  UV,  W . is  set  up,  and  these 

are  allocated,  without  replacement,  to  each  block  factor  P, 
Q, ...  and  treatment  factor  A,  B, ...  in  turn;  the  actual  order  of 
allocation  depends  on  the  block  structure.  Prolog  rules  ensure 
that  the  design  specification  is  adhered  to  at  each  stage,  the 
stauis  of  the  key  being  monitored  by  means  of  a  number  of 
internal  lists.  Backtracking  occurs  if  the  list  of  candidates  is 
exhausted  before  the  key  is  complete;  sub-optimal  choices  of 
generators  may  thus  be  discarded  and  alternatives  substituted. 
Combinatorial  explosion  is  controlled  using  symmetry 
concepts.  A  full  exposition  of  the  search  algorithm  may  be 
found  in  Zemroch  et  al.  (1989). 

5.  Types  of  design 

Version  1  of  KEYFINDER  has  general  procedures  for 
constructing  balanced  multiple-,  single-  and 
firactional-replicate  factorial  designs  using  design  keys.  The 
experimental  units  in  these  designs  may  be  arranged,  as 
necessary,  into  four  basic  types  of  block  structure,  ruuncly , 

P  (“simple”) 

P  +  Q  +  R  +  ...  ("row  and  column”) 

P*Q*R*...  (=P  +  Q  +  P.Q  +  R  +  ...)  (“crossed”) 

P  /  Q  /  R  /  ...  (=  P  +  P.Q  +  P.Q.R  +  ...)  (“nested”) 

The  terms,  P,  Q,  P.Q,  R . in  the  above  “block  formulae” 

correspond  to  random  terms  5,,  iij,  y^, ...  (say)  in  the 
mixed  anal  ysis-of- variance  model  (see,  for  example,  Scheffd, 


1959),  with  the  subscripts,  i,  j,  k, ... ,  indexing  the  levels  of  P, 
Q,  R, ... ,  respectively.  Split-plot  designs  may  be  constructed 
with  the  block  factors  in  a  nested  structure  and  the  allocation 
of  treatments  to  “error  strata”  (Neldcr,  1965)  completely 
under  the  user’s  control.  Keys  generate  the  general  class  of 
design  in  which  the  number  of  experimental  units  and  the 
numbers  of  levels  of  the  various  block  aid  treatment  factors 
are  all  differing  powers  of  the  same  prime  p. 

The  next  release,  previewed  at  the  present  Interface '91™ 
conference,  will  offer  a  much  wider  range  of  designs.  One 
of  the  most  significant  enhancements  will  be  facilities  for 
generating  “compromise  plans”  of  intermediate  resolution. 
These  designs  are  particularly  useful  in  situations  where  cost 
constraints  force  the  experimenter  to  use  a  resolution-3 
design  in  which  main  effects  are  aliased  with  two-factor 
interactions.  Whilst  the  user  may  feel  confident  a  priori  of  the 
nonexistence  of  most  of  the  possible  two-factor  interactions, 
there  may  be  some  pairs  of  treatments  which  he  has  nagging 
doubts  about  KEYFINDER ’s  new  compromise-design 
procedures  allow  the  user  to  request  (say)  a  rcsolution-3 
design  in  which  a  named  subset  of  important  two-factor 
interactions  must  remain  unaliased  with  main  effects  and/or 
unconfounded  with  blocks.  Etesign  key  (1)  in  Section  3,  for 
example,  protects  the  AB  interaction  by  not  allocating  its 
associated  generator  UVW  to  any  block  or  treatment  factor. 
An  outline  of  how  the  compromise  algorithm  works  is  given 
in  Section  8  of  Zemroch  et  al.  (1989).  A  compromise 
resolution'3  design  can,  in  many  instances,  allow  a  set  of 
factors  to  be  examined  in  substantially  fewer  design  points 
than  a  blanket  resolution-4  design. 

“Asymmetrical”  designs  in  which  the  numbers  of  levels  of 
the  block  and  treatment  factors  are  free  to  vary  can  be 
constructed  in  KEYFINDER  by  the  “direct-product”  method. 
A  rcsoluUon-3  36-unit  2  x  3‘  x  6^  design,  for  example,  can 
easily  be  formed  by  combining  resoluuon-3  2^  and  3^ 
sub-designs,  generated  from  design  keys,  with  4  and  9  units 
rc.spectively.  The  36  rows  of  the  main  design  matrix  arc 
obtained  by  juxtaposing  the  rows  of  the  4-  and  9-unit 
sub-designs  in  all  possible  ways.  Each  6-level  factor  is 
formed  from  the  6  combinations  of  levels  of  a  2-lcvcl  and  a 
3-levcl  sub-factor 

The  development  of  KE^TEnDER  is  ongoing,  the  aim  being 
to  produce  a  comprehensive  toolkit  capable  of  generating 
almost  all  existing  blocked  and/or  fractional-replicate  designs 
of  sizes  likely  to  be  used  in  real-world  experimentation 
(under  certain  consuaints  of  balance).  In  order  to  make  the 
program  as  complete  as  possible,  the  main  effort  is  presently 
being  devoted  to  developing  methods  for  generating  balanced 
designs  that  cannot  be  obtained  from  design  keys  and/or  the 
direct-product  method.  Typically  these  designs  will  have 
numbers  of  units  that  are  not  prime  powers  and  they  will 
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include  both  symmetrical  and  asymmetrical  arrangements. 

6.  Randomization 

Randomization  of  an  experimental  design  reduces  the  risk  of 
certain  treatment  levels  being  unfairly  favoured  or 
disfavoured  in  the  experiment  by  extraneous  sources  of 
variation,  for  example  equipment  wear,  fertility  trends,  or 
changes  in  the  weather.  Most  of  the  design  keys  generated  by 
KEYFINDER  yield  designs  with  the  levels  of  the  treatment 
factors  in  some  sort  of  systematic  order.  This  can  increase  the 
chances  of  systematic  external  factors  introducing  bias  and 
heightens  the  importance  of  correctly  randomizing  the  design 
in  accordance  with  the  block  structure. 

Randomization  in  KEYFINDER  is  a  three-stage  process,  all 
stages  being  optional  and  under  the  user’s  complete  control. 
First,  the  levels  of  each  block  and  treatment  factor  are 
randomized  by  performing  a  random  permutation  on  the 
integers  0, 1,..., s-1  labelling  its  s  levels;  different 
permutations  are  used  for  each  factor.  If  a  block  factor,  Q  say, 
is  nested  within  another  block  factor  P  (e.g.  fields  within 
farms),  then  the  levels  of  Q  are  randomized  separately  for 
each  level  of  P.  If  the  design  is  blocked,  the  next  stage  in  the 
process  is  to  sort  the  experimental  units  according  to  the 
(randomized)  levels  of  the  block  factor(s).  The  final  step  is  to 
randomize  the  order  of  the  experimental  units.  If  the  design  is 
blocked,  then  the  units  are  randomized  separately  within  each 
block  (or  block  factor  combination). 

7.  Performance 

On  a  SUN  3  or  SUN  4  workstation,  KEYFINDER  can 
generate  keys  for  designs  with,  say,  250  or  fewer  points, 
without  time  or  storage  becoming  a  problem.  Thus  the 
program  caters  comfortably  for  most  of  the  design  sizes  likely 
to  be  used  in  real-world  experimentation.  However,  storage 
problems  are,  at  present,  a  constraint  on  the  scope  of  the  PC 
version  and  this  cannot  generate  keys  for  designs  with  more 
than  about  100  points.  Expected  software  and  hardware 
developments  should  ameliorate  these  problems  in  the  near 
futiiie. 
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Abstract 

This  paper  describes  a  nonparanietric  application  of 
CART  (Breiman  et  al.,  1984)  to  semi-Markov  models,  to 
provide  a  nonparanietric  regression  analysis  of  transition 
data.  Modeling  data  without  any  assumptions  about 
the  nature  of  the  underlying  distributions  is  needed  for 
initially  investigating  predictor  effects  in  an  exploratory 
analysis.  The  semi-Markov  assumption  specifies  a  struc¬ 
ture  for  the  transition  process,  which  is  characterized  by 
the  one-step  transition  distributions.  The  nonparamet- 
ric  regression  is  done  on  these  distributions.  For  each 
one-step  transition  distribution,  the  recursive  partition¬ 
ing  of  the  variable  space  allows  greater  interpretability 
of  the  data  by  splitting  the  data  into  homogeneous  sub¬ 
populations,  and  by  providing  insight  into  the  relative 
importance  of  the  different  predictors,  and  the  way  in 
which  they  interact.  This  method  is  then  applied  to 
modeling  payment  source  changes  of  nursing  home  resi- 
d^’nts. 

1  Introduction 

Transition  processes  occur  naturally  in  many  set¬ 
tings,  classical  examples  are  progression  between  disease 
states,  changes  in  employment,  etc.  Probabilistic  mod¬ 
eling  of  a  transition  process  requires  assumptions  about 
the  dependence  of  the  process  on  its  past  historv.  Manv 
rimes  a  first-order  Markov  assumption  is  made,  identify¬ 
ing  this  dependence  only  on  the  current  occupied  state, 
A  Markov  assumption  for  continuous  time  processes  can 
be  (xtended  to  a  semi-Markov  assumption  which  allows 
for  non-exponential  waiting  times  in  states  I.agakos  el 
al.  (1978)  proposed  nonparameiiic  estimates  fin  a  ho¬ 
mogeneous  semi-.Markov  process  tiowever.  in  applying 

"Hrsearrh  pnrtmllv  siipp-.rled  l)v  n  ursnl  from  tlir  Aurnrv  for 
<”'arr  Ptthry  hiuI  jif Of i 1 .1  2 


this  method  to  some  of  our  application  data,  we  found 
that  the  estimated  process  predicts  very  poorly  (Intra¬ 
tor,  1991a).  One  reason  for  this  is  that  the  process  varies 
for  different  subpopulations.  This  observation  is  the  mo¬ 
tivation  for  this  work;  We  would  like  to  be  able  to  find 
the  processes  of  diffr^rent  subpopulations  within  a  non- 
paramelric  framework.  The  different  subpopulations  can 
be  identified  by  answering  some  prognostic  type  ques¬ 
tions  stich  as:  Which  are  the  important  variables  for 
prediction?  Can  these  variables  be  ranked  in  ord^r  of 
importance?  Which  questions  most  likely  lead  to  oth¬ 
ers  so  as  to  determinate  a  sample  more  homogeneous  in 
term  of  its  waiting  time  distribution?  These  questions 
naturally  lead  to  decision  trees. 


Classification  and  regression  trees  (Breiman  et  al., 
1984)  is  a  method  that  recursively  partitions  the  space 
of  explanatory  variables,  building  a  binary  decision  tree. 
Since  a  full  grown  tree  may  be  biased  towards  the  train¬ 
ing  data,  they  suggest  to  prune  the  full  grown  tree  by 
penalizing  the  relative  improvement  of  a  split  compared 
to  the  addition  of  an  extra  node.  In  this  way  a  sequence 
•  >f  nested  trees  can  be  defined,  and  one  can  chose  the 
''best”  nested  tree  in  an  exploratory  fashion,  or  by  a 
g.sod  estimate  of  a  prediction  error.  ,\mong  the  contribu¬ 
tions  of  CAKT  are  the  ease  of  interpreting  its  results,  by 
providing  insight  and  understanding  into  the  predictive 
structure  of  the  ilala.  It  is  a  variable  selection  method 
which  helps  in  reducing  sensitivity  to  many  variables  in 
a  model.  It  is  relatively  unbiased  to  training  data,  due 
to  the  pruning  mechanism.  It  is  tolallr  nonparamet- 
ric  and  does  not  require  undeilving  mmlel  assumptions. 
Most  imporlanilv,  it  identities  effects  within  subgroups 
in  coiiliasl  to  standard  regression  methods  which  iden¬ 
tify  effects  across  the  entire  sample. 
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2  Methodology 

Gordon  and  Olshen  (1985)  presented  perhaps  the  first 
extension  of  regression  trees  (CART)  to  survival  data. 
Regression  trees  for  survival  data  are  nonparainetric 
methods  for  estimating  the  distribution  of  a  censored 
failure  time  r.v.  T,  given  regressors  r,  Pr{T  >  <|jr).  Un¬ 
der  a  semi-Markov  assumption  it  is  possible  to  reduce 
transition  data  to  a  set  of  conditional  one-step  survival 
distributions  and  apply  an  extensions  of  CART  to  sur¬ 
vival  data  to  each  one-step  waiting  time  distribution.  We 
will  first  discuss  the  reduction,  and  then  give  the  high¬ 
lights  of  an  extension  of  CART  to  survival  data:  Survival 
trees  (Intrator,  1991b). 

2.1  Reduction  to  one-step  transitions 

A  finite  state  space  continuous  time  semi-Markov  pro¬ 
cess  can  be  defined  as:  (a)  a  continuous  time  process 
with  a  Markov  embedded  chain  of  state  occupancies;  (b) 
distribution  of  waiting  times  that  depend  only  on  cur¬ 
rent  state  and  destination  state,  which  are  independent 
between  epochs. 

The  general  likelihood  under  this  model  is 

N  M, 

1  =  1  m  =  I 

for  N  individual  histories,  with  M,  transitions  for  each 

individual  history  /,  (cq,  . -j 

sattes,  and  tj  denote  waiting  time  in  state  - 1 . 
are  the  initial  state  probabilities,  and  f{j,l,i)  the  den¬ 
sities  of  transition  from  state  i  to  state  j  at  time  /.  If 
censoring  is  considered  an  abst)rbing  state  we  can  rewrite 
the  likelihood  by: 

-  n  n  {fi'''=ii 

At, 

m  -  I  ^ 

where  A  is  the  set  of  absorbing  states,  and  T  is  the  set 
of  transient  states.  ^(  1 . 5:  e.  </)  1  if  n  h  and  <•  d. 

and  0  otherwise.  0(c'  c)  is  the  transition  probabilitv  <>{ 
the  embedded  Markov  chain. 

Under  this  framework  every  particular  transition  is  a 
separate  failure  event,  with  processes  at  t he  same  current 
state  with  other  destinations  considered  as  “censored". 
Applying  survival  trees  to  /(/  ;.  :'  )  slioiild  indicate  the 
structure  of  the  variables  affecting  this  conditional  one 
step  transition  (list riluition 


2.2  Survival  trees 

Any  extension  of  CART  includes  the  following  ingredi¬ 
ents  that  should  all  be  nonparametric:  (1)  Prediction 
rule;  (2)  Dplitting  rule  for  growing  the  tree;  (3)  Pruning 
mechanism;  (4)  Method  for  tree  selection. 

Intrator  (1991b)  reviews  the  different  extensions  of 
CART  to  survival  data  in  liu  of  these  points  and  pro¬ 
poses  an  extension  in  which  all  above  ingredients  are 
addressed  to  achieve  CART's  advantages.  The  following 
is  a  summary  of  that  extension. 

(1)  Prediction  rules  are  the  nonparametrically  esti¬ 
mated  conditional  Kaplan-Meier  survival  distributions 
(which  are  equivalent  to  the  estimates  of  Lagakos  et  al., 
1978  .  and  Dinse  and  Larson,  1986). 

(2)  The  splitting  rule  defined  is  based  on  between  node 
separation  measures  such  as  extensions  to  censored  data 
of  tank  type  tests  for  the  conditional  survival  distribu¬ 
tions. 

(3)  The  pruning  method  is  based  on  the  significance 
level  values  v{t.d)  Pr(5(f^)  ^  5(<^)),  the  p-value 
of  the  test  (i  for  the  split  at  that  node  to  the  left  and 
right  branches  and  /^.  The  chosen  split  is  i(<)  = 
miiij  v{1.d).  'fhe  risk  for  every  terminating  node  is  de¬ 
fined  to  be  zero.  For  any  decision  node  t  we  define  the 
risk  of  its  branch  tr  by: 

R(fr)  3.  -p{i)  ■  max  (l  -  v{t)) 

rC’T  It 

where  />(!)  is  the  proportion  of  observations  at  node  t. 
Notice  that  the  maximization  is  done  over  all  the  nodes 
of  the  branch  tj  and  not  only  over  the  terminating  nodes 
tr-  so  R{l  r)  is  monotonicallv  non-increasing  when  going 
from  top  down  in  the  tree.  This  definition  allows  us  to 
use  the  CART  method  for  cost  complexity  pruning  in 
the  usual  way.  For  more  information  about  the  pruning 
see  Intrator  ( 1991b). 

(4)  We  can  choose  a  tree  from  the  nested  sequence 
of  pruned  subtrees  in  an  exploratory  fashion  by  choos¬ 
ing  the  tree  at  a  prespecified  level  of  o.  or  by  choosing 
a  tree  with  a  certain  number  of  terminating  nodes.  Ex¬ 
ploratory  trees  servt  as  a  basis  for  comparison  with  other 
trees,  for  the  effects  of  covariates. 

An  alternative  to  exploratory  tree  selection  is  based 
on  computing  an  isliuuilt  R' {Tr,)  of  the  predic¬ 

tion  error  of  the  trees  {7,,}  in  the  sequence  of  pruned 
subtrees.  We  propose  to  Use: 

R  {i\]  V  ,,(/)  Pr(.s''(f)  7  >■‘^(0). 

'•  7. 

where  .''’'(0  is  the  estimated  survival  curve  at  node  t 
basi-d  ,.n  till  testing  data,  and  is  that  based  on 
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the  learning  sample.  We  estimate  the  prediction  error 
of  a  node,  Pr(S*'(t)  ^  5^(<)),  by  running  a  test  sam¬ 
ple  down  the  tree  and  comparing  survival  distributions 
between  the  training  sample  and  the  testing  sample,  at 
every  terminating  node.  We  can  choose  the  right-sized 
tree  as  that  subtree  with  the  lowest  estimate  of  the  risk. 

3  Application 

In  the  following  application  we  look  at  changes  in  pay¬ 
ment  sources  for  patients  newly  admitted  to  nursing 
homes.  More  details  about  this  study  can  be  found  in 
Mor  et  al.  (1991).  The  data  come  from  regular  assess¬ 
ments  of  patients  in  the  National  Health  Corporation 
(NHC),  a  chain  of  48  for-profit  nursing  homes  operating 
in  11  states,  mainly  in  Missouri,  Kentucky,  South  Car¬ 
olina,  and  Tennessee.  Assessment  records  are  collected 
at  admission,  periodically  thereafter,  and  at  discharge 
on  a  standard  form.  Data  on  payment  source  changes  is 
recorded  retroactively  with  the  correct  date  of  change, 
thus  payment  sources  are  monitored  continuously.  The 
specific  cohort  used  for  this  analysis  consists  of  all  newly 
admitted  patients  to  an  NHC  home  in  1982  or  in  1984. 
This  analysis  is  part  of  a  research  on  characterizing  res¬ 
idents  who  spend  down  to  Medicaid.  In  applying  the 
above  methods  our  aim  is  to  find  important  predictors 
for  modelling  the  process  of  payment  source  changes  till 
patients  become  Medicaid  recipients,  go  back  to  the  com¬ 
munity,  or  die. 

3.1  Results 

We  defined  the  following  state  space  for  this  application; 
Transient  states;  Private  Payment;  Medicare;  "Other” 
Insurance  payment;  Home;  Hospital.  Absorbing  states; 
Medicaid;  Death;  Censoring  (usually  due  to  return  into 
the  community).  Our  attention  in  this  paper  focuses  on 
the  transitions  from  Medicare  to  Medicaid  and  to  death. 
Discussion  of  other  transitions  can  be  found  in  Intrator 
(1991a)  .  The  variables  investigated  include  activity  of 
daily  living  (ADL),  a  heirarchical  scale  from  1  (best)  to 
5  (worst),  sex,  education,  marital  status,  living  arrange¬ 
ment  prior  to  entrance  into  nursing  home,  diagnoses  of 
acute  or  chronic  illness,  state  (KY,  MO,  SC,  TN),  age, 
and  initial  payment  source. 

Figure  1  presents  the  full  grown  tree  (without  pruning) 
for  the  transition  from  Medicare  to  Medicaid 

A  first  level  pruning  would  eliminate  terminating 
nodes  E  and  F,  and  a  second  level  pruning  would  elim¬ 
inate  all  subnodes  of  node  2.  State  participation  seems 
to  be  the  most  important  predictor  for  this  transition. 


Figure  1;  Full  Tree  for  Transition  from  Medicare  to  Med¬ 
icaid.  Squares  and  capital  letters  indicate  terminating 
nodes,  number  in  nodes  indicates  number  of  observations 
in  node. 

Specifically,  being  in  Kentucky  increases  the  hazard  of 
conversion  the  most.  The  competition  at  this  node  be¬ 
tween  the  split  on  Tennessee  and  Kentucky  is  very  close, 
(at  chi-square  of  143  vs.  145),  therefore  it  is  not  sur¬ 
prising  that  the  next  split  is  on  Tennessee.  Education  is 
important  only  in  Tennessee. 

Figure  2  presents  a  pruned  tree  for  the  transition  from 
Medicare  to  death.  The  pruning  was  done  in  an  ex¬ 
ploratory  manner,  at  a  level  q  =  .07. 

For  this  tree,  the  root  split  is  by  most  severe  functional 
impairment  ADL=5.  Patients  who  are  severely  impaired 
(ADL=:5)  are  then  split  according  to  whether  or  not  their 
impairment  is  acute  (hip  fracture  or  stroke).  Patients 
who  are  less  severely  functionally  impaired  (node  2)  are 
split  by  sex.  with  competition  of  the  split  from  both  the 
next  ADL  level  4,  and  from  acute.  The  next  split  for 
both  males  and  females  of  ADL  levels  1-4  is  on  acute, 
and  thereafter  on  ADL  levels. 

3.2  Cox  type  regression 

In  Intrator  (1991a)  a  Cox  type  regression  method  for 
transition  data  was  developed,  and  applied  to  this  data 
as  w-ell.  Here  we  would  like  to  compare  the  results  of 
that  method  (Table  1).  after  variable  selection  at  a  95% 
level,  with  the  tree  results. 

For  the  transition  to  death  both  methods  reveal  that 
ADL  is  a  most  prominent  predictor.  Sex  is  also  pre¬ 
dictive  in  both  analyses.  Kentucky,  or  any  other  state 
participation  is  not  present  in  the  trees  at  all,  although 
it  is  present  in  the  regression  analysis. 
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Figure  2:  Pruned  Tree  for  Transition  from  Medicare  to 
Death.  Squares  and  capital  letters  indicate  terminating 
nodes,  number  in  nodes  indicates  number  of  observations 
in  node. 


For  the  transitions  to  Medicaid,  the  trees  emphasize 
the  effect  of  state  participation,  and  education  in  Ten¬ 
nessee.  In  the  regression  model  we  have  Tennessee  and 
more  education  variables,  but  also  ADL  levels  which  do 
not  appear  in  the  trees.  This  may  reflect  either  cor¬ 
related  covariates,  or  effects  across  samples,  which  are 
eliminated  with  interactions. 

Further  analysis  of  this  data,  concentrating  on  those 
residents  initially  admitted  as  private  paying  individuals, 
under  a  Cox  model,  with  comparison  to  results  of  the 
tree  analysis  is  forthcoming  in  a  joint  work  with  Anthony 
Lancaster,  and  Vincent  Mor. 


From  Medicare 

Variable 

0 

d{0) 

To  Medicaid 

ADL  =  2 

-0.472 

0.228 

N  =  637 

1-8  Grades 

-0.449 

0.186 

9-12  Grades 

-0.920 

0.201 

>12  Grades 

-1.465 

0.249 

Tennessee 

-0.900 

0.201 

Home  Days 

-.994e-2 

.140e-2 

To  Death 

ADL  =  3 

0.903 

0.230 

N  =  1420 

ADL  =  4 

1.231 

0.230 

ADL  =  5 

1.825 

0.225 

South  Carolina 

-0.330 

0.132 

Kentucky 

-0.366 

0.134 

Male 

0.462 

0.060 

Table  1:  Coefficients  for  one-step  transition  from  Medi- 


4  Conclusions 

The  importance  of  the  tree  analysis  in  this  context  was 
to  highlight  the  structure  of  the  interactions  between  the 
variables  affecting  the  one-step  transition  distributions. 
The  Cox-type  regression  model  could  only  identify  ef¬ 
fects  across  the  sample,  thus  leading  to  identification  of 
meaningful  predictors  that  were  perhaps  only  correlated 
with  other  important  predictors.  The  interpretability  of 
the  tree  results  is  self  evident.  It  is  easy  to  point  out  the 
prognostic  variables  affecting  the  process.  The  regression 
model,  on  the  other  hand,  can  provide  easier  predictions 
of  summary  statistics,  as  total  probability  of  transition 
at  different  times,  and  expected  number  of  transitions. 
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Abstract 

Methodology  is  described  for  constructing  kernels  for  the 
purpose  of  identifying  and  separating  the  components  of  a 
mixture  of  densities.  One  such  kernel  has  the  property  of 
reducing  the  variance  of  the  individual  subcomponents  of  a 
mixture  thereby  making  them  more  visible.  A  second 
method  based  on  a  weighted  version  of  the  Mean 
Integrated  Square  Error  metric  takes  advantage  of  the 
properties  of  mixtures  comprised  of  densities  with  differing 
location  parameters.  The  resulting  kernel  focuses 
alternatively  on  either  the  right  or  the  left  side  of  the 
variate  support  region.  Combined  with  the  variance- 
reducing  kernel,  this  procedures  enhances  the  estimation  of 
either  the  leftmost  or  rightmost  mixture  subcomponent. 

1.  Introduction 

As  detailed  by  Titterington,  Smith  and  Makov  (1985), 
mixture  distributions  have  had  a  long  and  rich  history.  The 
methodology  outlined  below  is  a  kernel-based  curve 
estimation  approach  to  mixture  decomposition  which 
incorporates  two  novel  elements.  The  first  is  that  a  kernel 
is  constructed  which  has  the  property  of  reducing  the 
variance  of  the  individual  subcomponents  of  the  mixture. 
The  second  relies  on  modifying  the  kernel  to  enhance  the 
estimation  of  a  particular  subregion  of  the  density. 

This  second  procedure  takes  advantage  of  the  fundamental 
asymmetry  inherent  in  mixtures  which  have  components 
with  unequal  location  parameters.  Since  by  definition 
there  exists  some  region  of  the  density  where  one 
subpopulation  is  more  prevalent  than  another,  the 
estimation  of  individual  subcomponents  can  be  improved 
by  using  different  kernels  for  different  subregions.  In  this 
way  the  method  outlined  below  is  similar  to  the  variable 
kernel  method  described  by  Breiman,  Meisel  and  Purcell 
(1977).  Ihe  combination  of  a  variable  kernel  with  a 
variance-reducing  kernel  has  the  potential  to  greatly 
enhance  mixture  decomposition  methodology. 

Given  an  independent  sample  X, . from  the  density  /, 

the  kernel  estimator  of /is  defined  by  Silverman  (1986)  as 


where  K  is  the  kernel  function  and  h  is  called  the 
smoothing  parameter  or  bandwidth.  K  is  often  chosen  a 
priori  from  a  class  of  nonnegative,  symmetric  functions, 
for  example  the  family  of  Gaussian  densities  with  scale 
parameter  ^  (Rudemo,  1982). 

Another  popular  class  of  density  estimators  is  based  on 
orthogonal  series  expansions.  In  particular,  the  Fourier 
series  estimator  is  defined 

f(x)=  '^b^B^exp{2mkx} 

Jl  =  -Be 

where 

=  n''  Xexp[-2  TtikX^ }, 

>=i 

i  =  V^and  {B^},  the  multiplier  sequence,  is  a  sequence  of 
real  numbers  chosen  to  optimize  the  estimator  in  some 
respect.  For  example,  may  be  used  to  truncate  above 
expansion  at  some  optimal  point.  The  main  focus  of  this 
paper  is  the  choice  of  particular  multipliers  for  the  purpose 
of  identifying  and  separating  the  components  of  a  mixture 
distribution. 


One  of  the  advantages  of  Fourier  series  estimators  is  their 
near  identity  with  kernel  methods;  that  is,  with  few 
exceptions,  a  particular  estimate  may  be  expressed  as 
either  a  Fourier  series  estimator  or  a  kernel  estimator, 
whichever  interpretation  is  more  convenient.  This  can  be 
seen  through  a  simple  rearrangement  of  the  above 
expression  for  the  Fourier  series  estimator: 


hx)= 


n 

n  '^exp|-2/nkXj| 


j-i 


exp{2mfct} 


=  «  S^*expj2;hA:(jr- 


If  fc*  and  Zr=  <“’>  Il>en  {b^]  is  the  sequence  of 
Fourier  coefficients  of  the  kernel  defined  by  the  Fourier 
series  density  estimator.  Alternatively,  many  kernel 
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Figure  1.  Dirichlet  kernel  with  the  truncation  point,  m, 
equal  to  6. 


estimates  can  be  expressed  as  Fourier  series  by  using  the 
expressions  for  the  characteristic  functions  of  truncated 
densities  given  by  Krotunal  and  Tarter  (1968). 

As  an  illustration,  consider  the  estimator  determined  by  the 
multiplier  sequence  h*  =  1,  lk|  <  m;  h*  =  0,  |k|  >  m,  where 
m,  the  truncation  point,  is  some  positive  integer.  This 
leads  to  what  Wahba  (1981)  calls  the  raw  Fourier  series 
estimator; 

f{x)=  ^Btexp{2mkx}. 

The  kernel  defined  by  this  multiplier  sequence  is  the 
Dirichlet  kernel,  shown  in  Figure  1. 

The  close  correspondence  between  kernel  and  Fourier 
estimators  means  that  theoretical  results  derived  for  one 
method  are  often  applicable  to  the  other  method  as  well. 
In  addition,  the  kernel  form  the  estimator  is  often  quite 
helpful  in  conceptualizing  the  series  estimation  process. 
Thus,  in  the  following  exposition  we  will  interchange 
kernel  and  series  terminology  and  concepts  where  one  is 
more  appropriate  than  the  other.  In  particular,  a  mixture 
decomposition  process  will  be  developed  from  the  series 
point  of  view  but  will  also  be  presented  in  terms  of  the 
kernel  defined  by  the  procedure. 

2.  Variance-reducing  Kernels 

As  noted  above,  the  multiplier  sequence,  and  thus  the 
kernel,  is  usually  chosen  to  optimize  the  estimator  with 
respect  to  some  global  measure  of  accuracy,  that  is,  with 
respect  to  some  metric.  For  example,  an  extensively 
studied  meuic  is  the  Mean  Integrated  Square  Error,  MISE: 

AfJ)  =  Ej{f(x)-f(x)fdx 

where  the  integral  is  taken  over  the  entire  range  of  the 
density's  support.  Methods  for  choosing  an  MlSE-optimal 
multiplier  sequence  based  on  selecting  a  truncation  point, 
m,  for  the  raw  Fourier  series  estimator  have  been  proposed 


Figure  2.  Estimate  of  a  mixture  of  two  Gaussian  densities: 
/(jt)=.6N(0,l)  +  .4N(2.5,l). 

by  Tarter  and  Kronmal  (1976),  Hart  (1985)  and  Higgle  and 
Hall  (1986).  More  general  multiplier  sequences  have  been 
suggested  by  Watson  (1969),  Fellner  (1974),  Brunk  (1978) 
and  Wahba  (1981). 

By  minimizing  the  MISE,  these  methods  all  strive  to 
produce  density  estimates  which  are  optimal  in  an  overall 
sense,  that  is,  accurate  throughout  the  entire  range  of  the 
estimate.  Thus,  these  multiplier  sequences  are  designed  to 
provide  an  overall  view  of  the  density  function.  Other 
choices  of  multipliers,  however,  may  be  more  suitable  for 
more  specialized  purposes.  Two  such  alternatives 
described  below  are  designed  to  help  identify  and  separate 
the  individual  components  of  a  mixture  of  densities. 

Consider  the  density  estimate  shown  in  Figure  2.  Here 
{b^}  was  chosen  to  minimize  the  MISE  according  to  a 
procedure  outlined  in  Tarter  and  Kronmal  (1976).  The  true 
density  is  a  mixture  of  two  Gaussian  curves,  although  the 
substantial  overlap  between  the  components  has  made  this 
structure  difficult  to  see.  To  enhance  the  distinction 
between  the  subcomponents,  a  density  estimate  could  be 
consUDCted  which  reduced  the  overlap  between  them. 
Such  an  estimate  is  shown  in  Figure  3;  here  the  two 
subpopulations  are  clearly  visible.  Although  certainly  not 
optimal  in  an  overall  sense,  the  estimate  shown  in  Figure  3 
is  certainly  more  useful  than  the  estimate  in  Figure  2  for 
searching  for  hidden  subcomponents. 

The  procedure  used  to  create  the  estimate  shown  in  Figure 
3  is  described  in  Chapter  4  of  Titterington,  Smith  and 
Makov  (1985)  and  relies  on  a  particular  choice  of 

multiplier  sequence.  Specifically,  let  b^  be  the  multiplier 
sequence  chosen  for  an  initial  estimate  of  the  density,  like 
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depicted  in  Figure  4  both  incorporated  the  initial  sequence 
Sf  =l,|jtl<6;  ft*  =0,|ifc|>6  but  used  different  values  of 

.  Note  that  for  a  very  small  value  of  A  the  kernel  looks 
very  much  like  the  Dirichlet  kernel  shown  in  Figure  1  and 
thus  the  process  has  little  effect.  The  larger  value  of  A 
results  in  a  more  extreme  kernel;  in  particular,  the  amount 
of  negativity  increases  as  A  increases.  Since  as  noted  in 
Jones  (1991),  a  nonnegative  kernel  inflates  the  variance  of 
an  estimate,  it  is  not  surprising  that  variance  is  reduced  by 
increasing  A  and  hence  increasing  kernel  negativity. 

Once  the  overlap  between  components  has  been  reduced,  a 
procedure  outlined  in  Tarter  (1979)  can  be  used  to 
eliminate  the  contribution  of  one  of  the  now  distinct 
subcomponents.  The  variance-reduction  process  can  then 
be  reversed  resulting  in  an  estimate  of  the  remaining 
component.  (The  process  also  extends  easily  to  any 
number  of  distributional  subcomponents.)  Once  isolated 
the  remaining  component  can  be  analyzed  by  the  model 
identification  techniques  described  in  Tarter  and  Lock 
(1988). 

3.  Locally-enhanced  kernels 

In  the  previous  section  it  was  suggested  that  a  globally 
optimal  estimate  of  the  density  should  be  constructed  prior 
to  decomposing  a  mixture  distribution.  However,  the 
ultimate  goal  of  the  above  example  was  to  produce  and 
analyze  an  estimate  of  only  the  left  component  of  the 
mixture.  With  this  in  mind  it  is  clearly  advantageous  to 
estimate  the  left  side  of  the  distribution  as  accurately  as 
possible,  even  at  the  expense  of  losing  some  resolution  in 
the  right  side  of  the  estimate.  This  can  be  accompUshed  by 
selecting  a  multiplier  sequence  which  minimizes  the 
weighted  MISE: 

J(f,f,w)  =  E  J{7(jt)-/(Jt)}'w(x)tfx. 


Figure  4.  Kernels  determined  by  the  {ft*  }  multiplier  sequence.  On  the  left,  A  =.001 ;  at  right  A  =.001 . 


Figure  3.  Estimate  of  the  density  described  in  Figure  2 
after  application  of  the  variance  reduction  process. 


that  shown  in  Figure  2.  Usually  ft*  will  be  chosen  to  be 
optimal  in  some  global  sense  such  as  minimal  MISE.  A 
reduced-overlap  estimate  of  the  density  can  then  be 
obtained  by  selecting  the  sequence 

ft*  =ft*  exp|2(n*A)^J,  1:  =  ±1,±2,... 

where  A  is  a  user-selected,  positive  number.  Although 
used  on  a  mixture  of  Gaussian  components  here,  this 
process  has  been  shown  by  Tarter  (1979)  to  be  applicable 
to  a  broad  class  of  mixtures. 

The  application  of  {ft*}  reduces  the  variance  of  the 

estimated  subcomponents  while  leaving  the  other  moments 
of  the  distribution  unaffected.  The  degree  of  variance 
reduction  is  detennined  by  the  magnitude  of  the  constant 
A .  The  effect  of  A  can  be  seen  by  graphing  the  kernel 

detennined  by  the  multiplier  sequence  {ft/  } .  The  kernels 
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The  weight  function  w(x)  is  selected  by  the  researcher  to 
emphasize  the  estimation  of  a  particular  region  of  the 
density.  The  function 

Wj^{x)  =  \  + a^miltrx),  jc  e[0,l],  a6[0,l], 

where  Of  is  a  user-selected  constant,  can  be  used  to 
increase  the  accuracy  of  the  left  side  of  the  density 

estimate.  A  graph  of  h'l(jc)  for  three  different  values  of  a 
is  shown  in  Figure  5. 

Methods  for  choosing  [b^}  to  minimize  the  weighted  MISE 
are  explored  in  Lock  (1990)  and  Tarter,  Freeman  and 
Polissar  (1990).  In  particular,  both  authors  consider  the 
case  where  w(x)  can  be  represented  by  a  three-term  Fourier 
expansion; 

1 

w(jc)=  e\p{2 jtikx}. 

This  is  the  case  for  the  weight  function  W[^{x).  Utilizing 
this  class  of  weight  functions,  methods  are  detailed  in  Lock 
(1990)  for  choosing  an  optimal  truncation  point  for  the  raw 
estimator  and  for  choosing  more  general  multipliers  as 
well.  Simulation  results  show  the  efficacy  of  the  methods 
in  increasing  the  local  accuracy  of  the  estimator. 
Combined  with  the  techniques  described  in  the  previous 
section,  these  methods  offer  a  promising  new  approach  to 
mixture  decomposition. 
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Abstract 

Neural  Network  Learning  Systems  are  models 
which  are  loosely  inspired  by  notions  of  how  self¬ 
organization  and  learning  in  biological  systems 
might  occur.  These  models  are  closely  related  to 
many  established  pattern  recognition,  classification, 
and  regression  techniques.  Many  exciting  applica¬ 
tions  of  these  methods  are  being  pursued,  including 
nervous  system  modeling,  robotics,  signal  process¬ 
ing,  zipcode  and  speech  recognition,  speech  produc¬ 
tion,  computer  backgammon,  and  financial  analysis. 
This  short  paper  is  intended  as  a  pointer  to  some 
of  the  vast  literature  covering  this  field. 


Introduction 

Statistics  and  neural  network  learning  systems  have 
much  in  common.  Since  neural  network  learning 
systems  are  being  developed  in  a  wide  variety  of 
contexts,  statisticians  are  likely  to  find  that  the  field 
offers  many  relev’ant  and  exciting  avenues  to  ex¬ 
plore.  Furthermore,  statisticians  are  weU  equipped 
to  make  significant  contributions  to  this  field. 

As  many  excellent  sources  on  neural  networks  are 
available,  I  will  not  attempt  to  provide  a  complete 
and  detailed  introduction  to  and  overview  of  this 
field  in  the  short  amount  of  space  available  here. 
Rather,  I  shall  make  a  few  brief  comments  and  pro¬ 
vide  some  pointers  to  the  literature. 


Types  of  Learning 

Neural  network  learning  systems  can  be  grouped 
into  three  categories:  supervised  learning,  unsuper¬ 
vised  learning,  and  reinforcement  learning.  Super¬ 
vised  and  unsupervised  learning  systems  include  a 
number  of  standard  statistical  methods,  while  rein¬ 
forcement  learning  systems  are  more  similar  to  the 
way  animal  learning  systems  probably  work. 

Supervised  learning  systems  are  analogous  to  sta¬ 
tistical  classification  and  regression  techniques.  Un¬ 
supervised  learning  systems  are  analogous  to  such 
established  statistical  methods  as  density  estima¬ 
tion,  cluster  analysis,  principal  components,  multi¬ 
dimensional  scaling,  and  so  on.  Neural  network 
models  often  differ  from  their  statistical  analogs, 
however,  in  that  they  are  usually  nonlinear  and  are 
often  real-time  or  adaptive. 

Reinforcement  learning  differs  from  classification 
and  regression  in  two  ways.  First,  the  system  is  not 
provided  with  explicit  target  values  during  training, 
but  is  simply  given  reward  or  penalty  signals  based 
upon  performance.  These  reward/penalty  signals 
may  be  delayed.  Second,  the  behavior  of  the  system 
has  a  random  component  which  allows  it  to  explore 
via  trail  and  error. 

Of  these  three  types  of  learning,  supervised  learn¬ 
ing  algorithms  have  received  the  greatest  amount 
of  theoretical  analysis  and  have  enjoyed  the  widest 
ranging  practical  application.  Unsupervised  learn¬ 
ing  systems  are  also  widely  used.  Reinforcement 
learning  algorithms  are  the  least  widely  applied  and 
the  most  poorly  understood.  However,  they  are  per¬ 
haps  the  most  interesting. 
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Some  Active  Areas  of  Research 

The  range  of  active  research  topics  in  neuraJ  net¬ 
works  covers  many  disciplines.  The  following  list  is 
adapted  from  the  list  of  oral  sessions  of  the  1990 
Neural  Information  Processing  Systems  conference 
program  (see  Lippmann,  Moody,  and  Touretzky 
1991): 

Learning  and  Memory:  Associative  Memory, 
Classical  Conditioning,  Memory  Organization 
and  Indexing,  Biophysics  of  Synaptic  Change, 
etc. 

Navigation  and  Planning:  Animal  Behavior, 
Robotics 

Temporal  and  Real  Time  Processing: 

Timeseries  Prediction,  Music,  Architectures  for 
Real  Time  Adaptive  Signal  Processing  and 
Control. 

Learning  and  Generalization:  Learning  Algo¬ 
rithms  &  Architectures,  Data  Representations, 
Theory. 

Visual  Processing: 

Motion  Processing,  Color  Constancy,  Percep¬ 
tual  Grouping,  Psychophysics,  Organization  of 
Visual  Cortex,  etc. 

Speech  Processing:  Speech  Recognition,  Lan¬ 
guage  Understanding 

Signal  Processing:  Nonlinear  Adaptive  SP;  Ani¬ 
mal  Perception,  eg.  Bat  Echo  Location;  Signal 
Pattern  Classification,  eg.  Dolphins  Speech 

Control:  Animal  Motor  Control,  eg.  VOR;  Robot, 
Vehicle,  and  Engine  Control;  Chemical  Process 
Control,  etc. 

Unsupervised  Learning:  Competitive  Learning, 
Hebb  Rules,  Clustering,  Exploratory  Projec¬ 
tion  Pursuit. 

Self  Organization:  Development  of  Cortical  and 
Dendritic  Organization. 
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Inspired  by  the  information  theoretic  idea  of  minimum 
description  length,  we  add  a  term  to  the  usual  back* 
propagation  cost  function  that  penalizes  network  com¬ 
plexity.  From  a  Bayesian  perspective,  the  complexity  term 
can  be  usefully  interpreted  as  an  assumption  about  prior 
distribution  of  the  weights.  This  method,  called  weight- 
elimination,  is  contrasted  to  ridge  regression  and  to  cross- 
validation.  We  apply  weight-elimination  to  time  series  pre¬ 
diction.  On  the  sunspot  series,  the  network  outperforms 
traditional  statistical  approaches  and  shows  the  same  pre¬ 
dictive  power  as  multivariate  adaptive  regression  splines. 

We  show  how  the  effective  number  of  parameters  changes 
during  training  by  analyzing  the  eigenvalue  spectra  of  the 
covariance  matrix  of  hidden  unit  activations  and  of  the 
matrix  of  weights  between  inputs  and  hidden  units.  We 
find  that  the  effective  ranks  of  these  matrices  are  equal  to 
each  other  when  a  solution  is  reached,  and  interestingly 
also  equal  to  the  number  of  hidden  units  of  the  minimal 
network  obtained  with  weight-elimination. 


1  INTRODUCTION _ 

Connectionist  networks,  also  called  brain-style  computation 
or  artificial  neural  networks,  arc  ensembles  of  interconnected, 
usually  nonlinear,  units.  The  values  of  the  connections  be¬ 
tween  the  units  arc  estimated  by  a  learning  algorithm.  This  ap¬ 
proach  differs  from  traditional  statistics  both  by  the  ubiquitous 
use  of  nonlinearitics  and  by  the  sheer  number  of  parameters. 

Connectionist  networks  were  first  applied  to  time  series  pre¬ 
diction  by  Lapedes  and  Farber  (1987).  Whereas  many  re¬ 
searchers  in  the  dynamical  systems  community  only  deal  with 
noise  free,  computer  generated  time  series,  we  focus  on  noisy, 
real  world  data  of  limited  record  length.  In  this  case,  the 
problem  of  overfitting  can  become  serious. 
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A  priori,  it  is  not  clear  what  network  size  is  required  to  solve 
a  given  problem.  If  the  network  is  too  small,  it  will  not  be 
flexible  enough  to  emulate  the  dynamics  of  the  system  that 
produced  the  time  scries  (“underfitting”).  If  it  is  too  large, 
the  excess  freedom  will  allow  the  network  to  fit  not  only  the 
signal  but  also  the  noise  (“overfitting”).  Both  too  small  and 
too  large  networks  thus  give  poor  predictions  in  the  presence 
of  noise. 

The  key  idea  of  weight-elimination  is  to  add  a  penalty  term 
accounting  for  network  complexity  to  the  usu^  cost  func¬ 
tion.  The  trade-off  between  performance  and  complexity 
is  reflected  in  the  sum  of  a  performance  and  a  complexity 
term.  There  is  a  u-shaped  minimum  between  the  extremes 
of  having  a  too  simple  network  that  produces  horrendous  er¬ 
rors  and  a  network  with  small  errors  on  the  training  data  that 
has  enormous  complexity.  This  sum  is  minimized  through 
back-propagation  (Rumelhart  et  al..  1986). 

1.1  ARCHITECTURE 

Fig.  1  shows  the  architecture  (the  pattern  of  connectivity  or 
topology)  of  a  feed-forward  network  with  one  hidden  layer, 
(For  the  time  series  we  analyzed,  one  hidden  layer  sufficed.) 
The  abbreviation  d-n-1  denotes  the  following  network: 

•  Thedinputuniisarcgivenihepastvaliiesxt-\,  ■ .  -.xt-d 
of  the  time  series  {x*}. 

•  The  input  units  are  fully  connected  to  n  nonlinear  hidden 
units. 

•  All  hidden  units  are  connected  to  a  linear  output  unit. 

•  Output  and  hidden  units  have  adjustable  biases  b. 

•  Tfie  weights  can  be  positive,  negative  or  zero. 

The  nonlinearities  are  located  in  the  activation  function  (or 
transfer  function)  of  the  hidden  units.  The  output  (or  response) 
of  a  hidden  unit  is  called  its  activation.  It  is  a  composition 
of  two  operators:  an  affine  mapping,  followed  by  a  nonlinear 
transformation.  First,  the  inputs  into  a  hidden  unit  h  arc 
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Figure  1:  Architecture  of  a  simple  feed-forward  network. 


linearly  combined,  and  a  bias  6h  (or  offset)  is  added, 

d 

(h  WhiXi  +  bh  =  Wh  ■  X  -F  bh  . 

t=i 

Xi  stands  for  X(_i,  the  value  of  input  i,  and  whi  is  the  weight 
between  input  unit  i  and  hidden  unit  h. 

Before  turning  to  the  second  step,  we  give  a  geometric  inter¬ 
pretation  of  ih-  A  hidden  unit  only  reponds  to  Wh  ■  x,  the 
projection  of  the  input  vector  x  =  (xi,X2,  ■■.,Xd)  onto  the 
weight  vector  Wh  =  (wh\,  whi,  Whd)-  Changes  in  the  in¬ 
put  that  are  orthogonal  to  the  direction  of  the  weight  vector 
have  no  effect  on  the  activation  of  the  hidden  unit.  The  “equi- 
activation  surfaces”  (on  which  a  hidden  unit’s  activation  is 
constant)  are  hyperplanes  orthogonal  to  the  direction  of  Wh- 
The  parameters  of  a  hidden  unit  /i  can  be  characterized  by 

•  a  direction,  tiJ/,/ 1 Iti;/, II, 

•  a  scale  parameter,  1 1  lii/,  1 1 ,  and 

•  a  location  parameter,  6/,  / 1 1  itJ/,  1 1 . 

The  symbol  1 1  ■  1 1  denotes  the  (Euclidean)  length  of  the  vector. 

The  second  step  can  be  viewed  as  “piping”  through  a 
nonlinear  activation  function.  We  here  choose  sigmoid  (or 
logistic)  units  whose  activations  5/,  arc  given  by 

s*  =  ^<f‘'  =  TTf«:r  =  5('  +  “"'5<‘)  ^ 


Viewed  from  the  perspective  of  statistics,  the  network  esti¬ 
mates  the  conditional  mean, 

E  [p{xi\xt-\,xt-i,  parameters)]  , 

where  p  is  the  probability  of  an  output  value  x,  for  a  given 
input  vector.  Note  that  this  probability  distribution  p  of  the 
model,  given  inputs  and  parameters,  is  not  to  be  confused 
with  the  probability  distribution  of  the  observed  output  given 
the  predicted  output,  i.e.,  the  error  model.  Whereas  the  prior 
assumptions  about  measurement  noise  and  model  misspecifi- 
cation  are  reflected  in  a  usually  simple  error  model  (we  here 
assume  the  errors  to  be  Gaussian  distributed),  the  conditional 
mean  depends  on  the  data  and  can  be  fairly  complicated. 

1.2  EVALUATION 

To  evaluate  and  compare  the  predictive  power  of  different 
algorithms,  we  use  the  relative  mean  squared  error  or  average 
relative  variance'  of  a  set  S,  arv(5),  defined  as 

Etg5  (^Seijt  -  predictions)^ 

Lies  (target^  -  mean)^ 

(1) 

The  sum  extends  over  the  set  S  of  pairs  of  the  actual  values 
(or  targets,  x*)  and  predicted  values  (x^).  The  averaging 
(division  by  A,  the  number  of  data  points  in  a  set  5)  makes  the 
measure  independent  of  the  size  of  the  set.  The  normalization 
(division  by  5^,  the  estimated  variance  of  the  data)  removes 
the  dependence  on  the  dynamic  range  of  the  data. 

This  quantity  corresponds  to  the  fraction  cf  the  squared  error 
of  the  data  that  is  not  “explained”  by  the  model.  The  symbol 
S  in  Eq.  1  indicates  the  data  set  used  to  compute  the  errors: 

•  Training  set.  This  part  of  the  data  is  used  to  estimate 
the  parameters.  The  fitting  error  (or  approximation  error 
or  in-sample  performance)  describes  the  fidelity  to  the 
data.  If  the  model  also  needs  to  be  determined,  this  set 
is  further  split  into  two  sets.  The  first  set,  still  called 
training  set.  is  used  for  direct  parameter  estimation.  Tlie 
second  set  is  referred  to  as  cross-validation  set  and  is  used 
to  determine  the  stopping  point  of  the  training  process.^ 

•  Prediction  set.  A  certain  part  of  the  available  data  is 
strictly  kept  apart  and  only  used  to  quote  the  expected 
performance  in  the  future  as  prediction  error  or  out-of- 
sample  performance. 


Xk) 


kes 


The  gain  a  can  be  absorbed  into  weights  and  biases  without 
loss  of  generality  and  is  set  to  unity.  The  sigmoid  performs  a 
smooth  mapping  (-oo,  -i-oo)  — ►  (0, 1). 

The  output  of  the  network  yields  the  prediction  X(  as  a  weighted 
sum  of  the  activations  of  the  hidden  units.  To  summarize,  con- 
nectionist  networks  globally  superimpose  nonlinear  functions 
to  produce  an  output  that  can  be  viewed  as  a  surface  above 
the  (xi ,  X2, Xd)-plane  of  the  inputs. 


'in  ihis  paper,  ihc  term  variance  refers  to  sums  of  squared  errors. 
In  the  statistics  community,  there  also  exists  a  narrower  meaning,  as 
in  bias-variance  tradeoff.  We  use  the  term  variance  to  denote  die 
sum  of  both  the  squared  bias  and  the  variance  in  the  narrower  sense. 
Incidentally,  in  the  connectionist  community,  the  term  bias  simnlv 
denotes  an  additive  constant  to  the  input  of  a  unit. 

^Our  useof  the  term  cross-validation  differs  from  repeated  leave- 
k-oul  procedures  in  that  we  often  pick  only  one  cross-validation  set 
Since  our  emphasis  is  on  the  training  process,  \sc  use  the  validation 
set  to  monitor  the  progress  during  training. 


364  A.S.  Weigend  and  D£.  Rumelhart 


Ultimately,  we  are  interested  in  good  performance  for  future 
predictions.  Can  we  simply  use  the  performance  on  the  train¬ 
ing  set  as  an  estimate  of  the  predictive  performance?  Do  we 
really  need  to  set  some  data  apart  as  prediction  set? 

It  is  well  known  that  the  in-sample  performance  can  be  a 
poor  estimate  of  the  out-of-sample  performance,  particularly 
in  the  presence  of  noise.  For  linear  regression,  it  is  sometimes 
possible  to  correct  for  the  usually  over-optimistic  estimate.  An 
example  is  to  multiply  the  fitting  error  with  (N  +  k)/{N  -  k), 
where  N  is  the  number  of  data  points  and  k  is  the  number 
of  parameters  of  the  model  (Akaike,  1970).  It  is  not  at  all 
clear  to  what  degree  such  approximations  hold  for  nonlinear 
models,  such  as  connectionist  networks. 

Now,  even  if  we  decided  to  ignore  the  issue  of  nonlinearities 
completely,  what  value  should  we  use  for  k?  Although  the 
number  of  available  parameters  of  the  network  is  fixed,  the 
number  of  effective  parameters  increases  during  training.  Al¬ 
though  all  parameters  are  already  present  at  the  beginning  of 
the  training  process,  the  number  of  parameters  that  are  effec¬ 
tive  for  solving  the  task  is  zero  since  they  were  just  randomly 
initialized.  We  show  in  Section  3.1.2  how  the  number  of  ef¬ 
fective  parameters  increases  during  training.  This  focus  on 
learning  is  different  from  the  typieal  assumption  in  statistics 
that  the  parameters  are  fully  estimated  at  the  time  of  model 
selection. 

Up  to  now,  we  have  ignored  the  question  of  how  to  determine 
the  values  of  the  weights  and  biases.  In  the  next  section,  we 
turn  to  this  question  of  parameter  estimation  and  also  to  the 
problem  of  model  selection  in  the  presenee  of  noise. 


2  LEARNING _ 

2.1  BACK-PROPAGATION 

We  use  the  error  back-propagation  algorithm  by  Rumelhart 
et  al.  (1986)  to  train  the  network;  the  parameters  are  changed 
by  gradient  descent  on  the  cost  surface  over  the  weights  and 
biases.  On  the  whole,  the  problem  of  building  a  network  that 
readily  memorizes  a  set  of  training  data  has  proven  easier  than 
expected.  However,  the  problem  of  good  generalization  has 
proven  more  difficult. 

2.2  GENERALIZATION 

Connectionist  networks  are  in  essence  statistical  devices  for 
inductive  inference.  There  is  a  trade-off  between  two  goals. 
On  the  one  hand,  we  want  such  devices  to  be  as  general  as 
possible  so  that  they  can  learn  a  broad  range  of  problems. 
This  recommends  large  and  flexible  networks.  On  the  other 
hand,  the  true  measure  of  an  inductive  device  is  not  how  well 


it  performs  on  the  training  examples,  but  how  it  performs  on 
cases  it  has  not  yet  seen,  i.e.,  its  out-of-sample  performance. 

Too  many  weights  of  high  precision  make  it  easy  for  a  net¬ 
work  to  fit  the  noise  of  the  training  data.  In  this  case,  when 
the  network  picks  out  the  idiosyncrasies  of  the  training  sam¬ 
ple,  the  generalization  to  new  cases  is  poor.  This  overfilling 
problem  is  familiar  in  inductive  inference,  such  as  polynomial 
curve  fitting.  In  the  extreme,  the  polynomial  fits  the  training 
points  exactly  and  merely  interpolates  between  them. 

There  are  several  potential  solutions  to  this  problem.  We  focus 
here  on  the  so-called  minimal  network  strategy.  The  under¬ 
lying  hypothesis  is:  if  several  networks  fit  the  data  almost 
equally  well,  the  simplest  one  will  on  the  average  provide  the 
best  generalization.  Evaluating  this  hypothesis  requires  (1) 
some  way  of  measuring  simplicity,  and  (2)  a  search  procedure 
for  finding  the  desired  network. 

The  complexity  of  an  algorithm  can  be  measured  by  the  length 
of  its  minimal  description  in  some  language.  The  old  but 
vague  intuition  of  Occam’s  razor — or  dream — can  be  formal¬ 
ized  as  the  Minimum  Description  Length  Criterion'.  Given 
some  data,  the  most  probable  model  is  the  model  that  mini¬ 
mizes  the  sum 

description  lcngth(dc:a  given  model) 

+  description  lcngth(modcl). 

This  sum  represents  the  trade-off  between  residual  error  and 
model  complexity.  The  goal  is  to  find  a  network  that  has 
the  lowest  complexity  while  fitting  the  data  adequately.  The 
complexity  of  a  network  is  dominated  by  the  number  of  bits 
needed  to  encode  the  weights.  It  is  roughly  proportional  to  the 
number  of  weights  times  the  number  of  bits  per  weight.  We 
focus  here  on  the  procedure  of  weight-elimination  that  tries 
to  find  a  network  with  the  smallest  number  of  weights. 

In  Section  3.1.1,  we  compare  weight-elimination  to  cross- 
validation:  in  that  case,  the  cost  function  only  consists  of  the 
error  term.  Overfitting  is  prevented  by  stopping  the  training 
early,  i.e.,  before  the  error  reaches  its  asymptotic  minimum. 
This  leads  to  a  network  with  fewer  effective  parameters  than 
the  total  number  of  weights  and  biases  (Section  3.1.2). 


2.3  WEIGHT-ELIMINATION 

In  1987,  Rumelhart  proposed  several  methods  for  finding 
minimal  networks  within  the  framework  of  back-propagation 
learning.  A  natural  description  of  the  complexity  of  a  network 
uses  quantities  such  as  the  size  of  the  weights,  the  number  of 
connections,  the  number  of  hidden  units,  the  number  of  layers 
of  hidden  units,  or  the  symmetries  of  the  network.  We  focus 
on  the  method  of  weight-elimination  that  considers  the  size 
of  the  weights  and  the  number  of  weights,  and  interpret  the 
complexity  term  as  a  prior  distribution  of  the  weights. 
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23.1  Method 


The  idea  is  indeed  simple  in  conception:  add  to  the  error  a 
term  which  counts  the  number  of  parameters.  We  are  looking 
for  a  differentiable  function  that  is  zero  for  zero  weights  and 
approaches  a  constant  for  large  weights.  We  choose 

1  +  wf/wl 

wo  is  the  scale  for  the  weights.  The  subscript  i  in  w,  simply 
enumerates  the  weights.  The  sum  extends  over  all  connec¬ 
tions  C.  Note  that  the  biases  do  not  enter  the  cost  function: 
all  offsets  are  a  priori  equally  probable.  (In  the  framework 
developed  below,  this  corresponds  to  a  non-informative  prior: 
the  probability  density  for  the  location  parameter  is  flat.) 

The  performance  term  depends  on  the  model  for  measure¬ 
ment  errors.  Since  we  assume  that  the  errors  are  Gaussian 
distributed,  the  complete  cost  function  is  given  by 


(target,  -outputjt)^  +  ^ 
k^T  iec 


1  +  wf/w^ 


(2) 


The  first  term,  summed  over  the  set  of  training  examples  T, 
measures  the  performance  of  the  network.  The  second  term 
measures  the  size  of  the  network.  A  represents  the  relative 
importance  of  the  complexity  term  with  respect  to  the  perfor¬ 
mance  term. 


The  learning  rule  is  to  change  the  weights  and  biases  according 
to  the  gradient  of  the  entire  cost  function,  continuously  doing 
justice  to  the  trade-off  between  error  and  complexity.  This 
is  different  from  the  methods  mentioned  in  Section  1.2  that 
consider  a  set  of  fixed  models,  estimate  the  parameters  for 
each  of  them,  and  then  compare  between  the  models. 


Figure  2:  Complexity  cost  (in  units  of  A)  of  a  weight  as  function  of 
the  size  of  the  weight  (in  units  of  u’o). 

The  complexity  cost  is  shown  in  Fig.  2  as  function  of  hv/i/’o- 
For  Wo,  the  cost  of  a  weight  approaches  unity  (limes 

A).  This  Justifies  the  interpretation  of  the  complexity  term  as 
a  counter  of  significantly  sized  weights.  For  <  uo.  the 
cost  is  close  to  zero.  “Large”  and  “small”  are  defined  with 


respect  to  the  scale  wo.  It  is  a  free  parameter  of  the  weight- 
elimination  procedure.  In  our  experience,  choosing  wo  of 
order  unity  is  good  for  activations  of  order  unity.  The  effects 
of  the  choice  for  wo  are  discussed  further  in  Section  2.3.3. 

A  is  dynamically  adjusted  in  Uaining.  This  dynamic  increase, 
described  in  detail  in  Weigend  et  al.  (1991),  is  related  to  the 
concept  of  iterated  training  as  opposed  to  one-shot  parameter 
estimation.  At  the  beginning  of  the  training,  the  weights  are 
not  useful  yet,  since  they  were  just  initialized  rar.domly.  Any 
significant  cost  for  complexity  would  devour  the  whole  net¬ 
work.  Hence,  A  starts  at  zero.  The  usual  subsequent  increase 
corresponds  to  attaching  more  importance  to  the  complexity 
term  or,  from  the  perspective  developed  in  the  next  section, 
to  sharpening  the  peak  around  zero  of  the  prior  distribution  of 
the  probability  density  function  of  the  weights. 

2.3.2  Interpretation  as  Prior  Probability 

In  a  Bayesian  framework,  the  complexity  cost  can  be  viewed 
as  the  negative  logarithm  of  the  prior  probability  of  a  weight. 


-3  -2  -1 


Figure  3:  Prior  probability  of  a  weight  as  function  of  the  size  of  the 
weight  (in  units  of  wo),  plotted  for  different  values  of  A. 


In  Fig.  3,  we  show  the  prior  probability  density  function  from 
which  single  weights  of  size  w,  are  drawn. 


prior  oc 


exp 


Y 

1  +  uj/wl) 


It  is  a  mixture  of  a  flat  distribution  and  a  bump  around  zero. 
Relevant  weights  are  drawn  from  the  flat  distribution.  Weights 
that  arc  merely  the  result  of  noise  arc  drawn  from  the  bump 
centered  on  zero;  they  are  expected  to  be  small. 

So  far,  we  have  only  described  our  choice  of  the  prior  for  a 
single  weight.  How  do  we  get  to  the  whole  network?  Assum¬ 
ing  that  the  weights  can  be  treated  as  independent,  we  simply 
sum  over  the  connections  in  Equation  2. 


2.3.3  Ridge  Regres.ssion  as  Special  Ca.se 

V/c  here  discuss  the  relationship  of  our  method  of  weight- 
elimination  to  weight-decay,  proposed  by  Hinton  and  by 


366  A.S.  Weigend  and  D.E.  Rumelhart 


Le  Cun  in  1987,  In  weight-decay,  a  small  percentage  of 
the  weight  is  subtracted  at  each  weight  update, 

Awi  =  (weight  change  due  to  error  back-prop.)  —  aw,  . 
This  can  be  viewed  as  an  exponential  decay  of  the  weight.  It 
corresponds  to  a  quadratic  complexity  cost  (oc  w?),  known  in 
the  statistics  community  as  ridge  regression.  It  is  contained 
in  the  weight-elimination  scheme  as  the  special  case  of  large 
wq.  Weight-decay  always  prefers  networks  with  many  small 
weights.  Weight-elimination  prefers  few  large  weights  over 
many  medium  sized  weights  in  the  region  where  it  acts  as 
a  counter.  The  scale  parameter  wo  allows  us  to  express  a 
preference  for  many  small  weights  (  wq  large)  versus  a  few 
large  weights  (wo  small).  Depending  on  the  dynamic  range 
and  the  number  of  the  units  of  the  preceding  layer,  wo  might 
be  given  different  values  for  different  layers  of  the  network. 

Expressing  the  cost  of  a  weight  as  a  prior  can  make  it  easier 
to  interpret  distributions  that  are  not  intuitive  when  viewed  as 
penalty  costs.  Nowlan  (1991)  proposes  a  mixture  of  a  few 
Gaussians  as  prior.  This  prior  assumes  that  networks  with 
weights  around  a  few  centers  are  more  likely  than  networks 
with  weights  of  many  different  values. 

We  now  apply  these  methods  to  time  scries  prediction. 

3  SUNSPOTS _ 

The  sunspot  scries  has  served  as  a  benchmark  in  the  statistics 
literature.  Within  the  paradigm  of  autoregression,  different 
models  differ  in  the  specific  choice  of  the  primitives  for  the 
surface  above  the  input  space.  In  the  simplest  case,  a  sin¬ 
gle  hyperplane  approximates  the  data  points.  Such  a  linear 
autoregressive  model  is  a  linear  superposition  of  past  values. 

The  evaluation  of  the  network  model,  however,  is  carried 
out  by  comparison  to  a  nonlinear  model,  the  threshold  au¬ 
toregressive  model  (TAR)  by  Tong  and  Lim  (1980),  see  also 
Tong  (1990).  It  has  served  as  a  benchmark  for  Subba  Rao 
and  Gabr  (1984),  for  Priestley  (1988),  for  Lewis  and  Stevens 
(1991),  for  Stokbro  (1991),  and  for  others. 


determine  a  stopping  point  when  there  is  no  complexity  term 
in  the  cost  function),  then  with  weight-elimination.  The  out- 
of-sample  performance  will  be  analyzed  in  Section  3.2. 


3.1.1  Internal  Validation  (Early  Stopping) 

The  learning  of  the  sunspot  series  of  a  12-8-1  network  is 
shown  in  Fig.  4  as  a  function  of  epochs.  An  epoch  is  one 
iteration  of  gradient  descent  in  which  the  network  secs  each 
point  from  the  training  set  once.  Training  with  standard  back- 
propagation  (no  weight-elimination)  is  displayed  in  the  left 
panel.  (The  panel  on  the  right  hand  side  is  discussed  in 
Section  3.1.3.) 


epochs 


epochs 


Figure  4:  Learning  curves  of  a  12-8-1  network.  The  average  relative 
single-step  prediction  variances  are  given  for  the  training  sets,  and 
early  and  late  prediction  sets  (as  well  as  for  the  cross-validation  set 
for  the  network  Pained  without  weight-elimination  on  the  left  side). 
The  vertical  lines  (A,  B,  a,  b)  indicate  different  stopping  points.  The 
average  relative  variance  is  normalized  by  the  variance  of  the  entire 
record,  =  1535. 


The  TAR  model  is  globally  nonlinear:  it  consists  of  two  local 
linear  autoregressive  models.  Tong  and  Lim  found  optimal 
performance  for  input  dimension  d  =  12.  They  used  yearly 
sunspot  data  from  1700  through  1920  for  training,  and  the 
data  from  1921  to  1979  to  evaluation  the  predictions. 

To  make  the  comparison  between  network  and  TAR  perfor¬ 
mance  as  close  as  possible,  we  use  theirexaci  data  for  training 
and  evaluation,  their  choice  for  the  input  dimension,  their  er¬ 
ror  model  and  their  evaluation  criterion.  The  only  remaining 
difference  is  the  choice  of  the  primitives  for  the  surface. 

3.1  LEARNING  THE  SERIES 

In  this  section  we  analyze  the  in-sample  learning  behavior 
of  the  networks;  first  with  a  cross-validation  set  (needed  to 


The  success  in  mastering  the  training  set  is  indieated  by  the 
monotonic  decrease  of  the  lowest  curve,  indicating  the  in- 
sample-performance  (or fitting  error).  To  get  a  feeling  for  the 
non-stationarity  of  the  time  series,  the  prediction  set  was  split 
in  two  parts,  1921-1955  and  1956-1979.  On  both  prediction 
sets,  the  error  first  decreases,  but  then  starts  to  increase:  the 
network  begins  to  use  its  resources  to  fit  the  noise  of  the 
training  set.  It  starts  to  pick  out  properties  that  are  specific  to 
the  training  set,  but  not  present  in  the  prediction  sets.  This  is 
an  indication  of  ovcrfitting. 

When  should  the  training  should  be  stopped?  Since  prediction 
sets  should  not  be  u.scd  for  this  decision,  a  validation  set  is 
required  to  determine  the  end  of  the  training  process.  To  get 
a  feeling  for  the  effect  of  the  sampling  error  by  picking  a  spe¬ 
cific  training  set- validation  set  combination,  wc  investigated 
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several  training  set-validation  set  pairs. 

The  validation  sets  consisted  of  22  years  chosen  at  random 
from  the  time  before  1920.  Those  points  were  removed  in  the 
corresponding  training  sets,  reducing  their  size  by  10%.  The 
variations  in  performance  due  to  different  pairs  of  training  and 
validation  sets  are  larger  than  the  variations  due  to  different 
sets  of  random  initial  weights.^  In  the  example  given  in 
Fig.  4,  the  validation  set  error  approaches  an  asymptotic  value. 
Since  it  does  not  increase,  it  is  not  entirely  clear  which  set  of 
weights  should  be  taken.  We  thus  compare  in  Section  3.2  the 
performance  for  two  stopping  points,  A  and  B. 

Some  of  the  problems  with  early  stopping  through  cross- 
validation  are  that  (1)  a  part  of  the  available  training  data 
cannot  be  used  directly  for  parameter  estimation,  (2 )  the  mon¬ 
itored  validation  set  error  often  shows  multiple  minima  as  a 
function  of  training  time  (even  in  the  simple  linear  case  ana¬ 
lyzed  by  Baldi  and  Chauvin,  1991),  (3)  the  specific  solution 
at  the  stopping  points  depends  strongly  on  the  specific  pair  of 
training  set  and  validation  set,  and  (4)  the  results  are  sensitive 
to  the  initial  parameters. 

Before  comparing  cross-validation  with  weight-elimination, 
we  turn  in  the  next  section  to  the  question  how  the  effective 
number  of  parameters  changes  with  training.  We  first  focus 
on  the  activations  of  the  hidden  units,  then  on  the  weights 
between  inputs  and  hidden  units. 

3.1.2  Effective  Dimension  of  Hidden  Units 

Still  within  the  framework  of  standard  back-propagation,  we 
analyze  the  change  of  the  effective  dimension  of  the  hidden 
unit  space  during  training  by  computing  the  spectrum  of  the 
eigenvalues  of  the  covariance  matrix  of  the  hidden  unit 
activations.  The  covariance  Qj  corresponds  to  the  two-point 
correlation  between  the  activations  of  the  two  hidden  units  i 
and  j,  computed  over  the  training  set, 

Co  =E[(5i-50(Sj -5;)]  , 

where  5,  =  E  [Sj]  is  the  mean  activation  of  hidden  unit  i, 
taken  over  the  set  of  training  points.  Since  the  covariance 
matrix  is  symmetric,  Cij  =  Cji,  its  eigenvalues  are  real. 

Linear  correlation  is  appropriate,  since  the  output  linearly 
combines  the  hidden  unit  activations.  The  number  of  signifi¬ 
cantly  sized  eigenvalues  is  a  measure  of  the  effective  dimen¬ 
sion  of  the  hidden  unit  space.  It  can  be  viewed  as  the  effective 
rank  of  the  covariance  matrix.  For  linear  networks,  Baldi  and 
Homik  (1992)  use  similar  concepts. 

^We  chose  the  years  for  the  validation  sets  randomly.  An  im¬ 
provement  might  be  to  only  consider  random  splits  where  the  first 
and  second  moments  (mean  and  variance)  of  the  validation  set  match 
the  training  set.  Another  idea  is  to  first  train,  stop  and  save  several 
networks  on  different  training-validation  pairs,  and  then  combine 
their  individual  predictions.  The  combination  is  done  by  freezing 
the  weights  and  biases  of  the  sub-nets,  and  only  letting  the  few  new 
combination  weights  adapt  to  the  entire  training  set 
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Figure  5:  Eigenvalue  speclrum  of  the  covariance  matrix  of  the  hid¬ 
den  unit  activations.  The  double  line  represents  the  fourth  largest 
eigenvalue. 

Fig.  S  shows  the  eigenvalue  spectrum  as  a  function  of  training 
time.'^  The  eigenvalues  correspond  to  the  variances  captured 
by  the  corresponding  eigenvectors.  In  the  figure,  we  plot  the 
square  root  of  the  eigenvalues.  They  correspond  to  the  stan¬ 
dard  deviations  “explained”  by  the  corresponding  principal 
components.  The  figure  shows  that  gradient  descent  extracts 
one  component  after  another.  This  provides  some  justifica¬ 
tion  for  the  whole  strategy  of  oversized  networks  and  early 
stopping:  the  dimension  of  the  hidden  unit  space  starts  essen¬ 
tially  at  zero  and  then  increases  in  training.  The  goal  is  to  stop 
at  just  the  right  dimension. 

So  far,  we  have  focused  on  eigenvalues  derived  from  hidden 
unit  activations.  We  now  turn  to  eigenvalues  derived  from 
weights.  We  analyze  the  singular  value  decomposition  of 
the  weight  matrix  between  inputs  and  hidden  units.  We  de¬ 
compose  the  12  X  8  weight-mauix  (inputs  x  hidden  units) 
into  two  orthogonal  matrices  and  one  diagonal  matrix  and  dis¬ 
play  the  square  root  of  the  eigenvalues  of  that  diagonal  matrix 
in  Fig.  6.  At  the  beginning  of  the  training,  the  eigenvalues  just 
reflect  the  initialization  of  the  weights.^  As  training  proceeds, 
the  dimension  spanned  by  the  weight  space  increases. 

Both  Fig.  5  and  Fig.  6  only  contain  information  from  the 
training  set.  We  now  compare  this  information  with  the  per¬ 
formance  on  the  prediction  set.  In  the  run  used  for  the  eigen¬ 
value  calculations,  the  out-of-sample  error  reached  its  mini- 

*The  activations  of  the  eight  hidden  units  for  each  of  the  209 
points  of  the  training  set  were  recorded  after  every  50  epochs  of 
training  with  learning  rate  0.03.  The  overshooting  of  the  largest 
principal  component  disappears  if  the  hidden  unit  activations  are 
multiplied  with  their  coTTesfxtnding  output  weights  prior  to  comput¬ 
ing  the  covariance  matrix. 

’We  started  the  training  wiUi  weights  drawn  from  a  uniform 
distribution  over  the  interval  [-0.03,0.03],  corresponding  to  almost 
linear  hidden  units. 
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Figure  6:  Eigenvalue  spectrum  of  the  singular  value  decomposed 
matrix  of  weights  between  input  and  hidden  units. 

mum  around  epoch  1000.  At  that  point  in  training,  the  hidden 
units  span  an  effectively  three  dimensional  space.  Extracting 
the  fourth  and  subsequent  eigenvalues  hence  corresponds  to 
overfitting. 

In  the  next  section,  we  turn  to  training  with  weight-elimination. 
Interestingly,  weight-elimination  yields  networks  with  three 
hidden  units.  This  agreement  between  the  effective  dimen¬ 
sion  of  hidden  units  at  the  onset  of  overfitting  and  the  number 
of  hidden  units  after  weight-elimination  is  encouraging. 

3.13  Weight-elimination 

As  in  back-propagation  without  weight-elimination,  we  start 
with  a  network  sufficiently  large  for  the  task.  The  training 
curve  for  back-propagation  with  weight-elimination  is  shown 
in  the  right  panel  of  Fig.  4.  Significant  overfitting  is  avoided, 
even  for  training  times  four  times  as  long.  Since  the  entire 
training  set  is  used,  we  are  relieved  from  the  uncertainty  of  a 
specific  choice  for  a  validation  set.  But  we  still  have  to  decide 
when  the  asymptotic  state  is  reached.  The  performance  of 
two  solutions  (a  and  b)  is  compared  in  Section  3.2.  It  turns 
out  that  the  exact  stopping  point  is  not  important.  In  the  first 
5000  epochs,  the  procedure  eliminated  the  weights  between 
the  output  unit  and  five  of  the  eight  hidden  units.  Only  three 
hidden  units  survived. 


from  xt_2,  and  to  the  third  hidden  unit  from  xj_i.  In  contrast 
to  the  output  weights,  only  very  few  of  the  weights  from 
the  input  units  to  the  active  hidden  units  disappeared.  (The 
parameters  of  the  network  are  given  in  Weigend  el  al.,  1990.) 

Predictions  are  obtained  by  adding  the  values  of  these  three 
hidden  units.  The  main  encoding  is  performed  by  the  nonlin¬ 
ear  projection  from  the  twelve  dimensional  input  space  onto 
the  three  dimensional  hidden  unit  space. 

3.2  PREDICTIONS  AND  COMPARISONS 

So  far,  we  have  concentrated  on  the  learning  behavior  of  the 
network.  Just  obtaining  a  small  network,  however,  is  not  an 
end  in  itself:  the  ultimate  goal  is  to  predict  future  values.  In 
this  section,  we  assess  the  predictive  power  of  the  network  and 
compare  it  to  other  approaches.  We  first  analyze  single-step 
predictions  and  then  turn  to  multi-step  predictions. 

3.2.1  Single-Step  Prediction 

The  term  single-step  prediction  (or  one-step-ahead  predic¬ 
tion)  is  used  when  all  input  units  are  given  the  actual  values 
of  the  time  series  (as  opposed  to  the  predicted  values).  To 
assess  the  single-step  prediction  performance,  we  use  the  rel¬ 
ative  mean  squared  error  (or  average  relative  variance,  arv), 
defined  in  Equation  1 . 

The  weight-eliminated  network  gives 

arv(train)  =  0.082  ,  arv(predict), 921-1955  =  0.086 

The  corresponding  values  for  the  tar  model  are 

arv(lrain)  =  0.097  ,  arv(prcdicl)]92i_i955  =  0.097 

Comparing  these  numbers,  we  see  that  the  single-step  predic¬ 
tions  of  the  network  and  the  benchmark  model  are  comparable. 
Despite  this  similarity,  significant  differences  will  appear  for 
predictions  further  than  one  step  into  the  future. 

3.2.2  Multi-Step  Prediction 

There  are  two  ways  to  predict  further  than  one  step  into  the 
future.  We  first  present  the  results  of  iterated  single-step  pre¬ 
dictions  and  subsequently  turn  to  direct  multi-step  predictions. 
Most  of  the  analysis  so  far  applies  to  regression  in  general. 
Iterated  predictions,  however,  are  specific  to  time  series. 


Weights  from  inputs  to  dead  hidden  units  have  no  effect  on 
the  output.  Since  there  is  no  reason  for  the  network  to  pay  a 
price  for  these  weights,  they  subsequently  get  also  eliminated. 
For  time  series  prediction,  weight-elimination  acts  as  hidden 
unit  elimination. 

We  analyzed  the  specific  solution  of  the  network  that  was 
stopped  at  point  b  and  subsequently  trained  with  a  very  small 
learning  rate  for  a  few  epochs.  The  main  contribution  to  the 
first  hidden  unit  comes  from  X(_9,  to  the  second  hidden  unit 


In  iterated  single-step  predictions,  the  predicted  output  is 
fed  back  as  input  for  the  next  prediction  and  all  other  input 
units  arc  shifted  back  one  unit.  Hence,  the  inputs  consist  of 
predicted  values  as  opposed  to  observations  of  the  original 
time  series.  The  predicted  value  for  time  t,  obtained  after  I 
iterations,  is  denoted  by  i,,/  . 

The  prediction  error  will  not  only  depend  on  7  but  also  on 
the  time  (<  -  /)  when  the  iteration  was  started.  We  wish  to 
obtain  a  performance  measure  as  a  function  of  the  number 
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of  iterations  /  that  averages  over  the  starting  times.  Since 
we  want  to  fully  exploit  the  standard  prediction  set  range  for 
the  sunspot  data  from  <begin  =  1921  to  <end  =  1955,  we 
compute  for  each  I  the  average 

1  1  2 
<END  -  (<BEGIN  -  1  +  /)  ^ 

‘BEGIN- l+i 

This  (average  relative)  prediction  variance  after  /  iterations 
is  shown  in  Fig.  7.  Only  to  indicate  the  spread  of  network 
performances,  we  give  several  network  solution.  The  letters 
A,B,a.b  refer  to  the  different  stopping  points,  shown  in  Fig.  4. 
The  differences  between  the  different  network  solutions  are 
not  significant. 


Figure  7;  Relative  prediction  error  after  I  iterations  for  the  sunspot 
series.  Gray  T's  give  the  performance  of  the  TAR  mode].  Black 
squares  show  the  performance  of  the  weight-eliminated  network 
with  three  hidden  units.  The  other  curves  indicate  the  performance 
of  the  network  solutions  from  Fig.  4. 

An  alternative  to  this  multi-step  prediction  by  iterated  single- 
step  prediction  is  direct  multi-step  prediction:  the  network  is 
trained  to  predict  directly  several  steps  ahead.  On  the  sunspot 
data  set,  liic  prediction  error  for  direct  multi-step  prediction 
was  worse  than  the  error  for  iterated  single-step  prediction. 

In  summary,  although  we  took  extreme  care  not  to  gain  any 
unfair  advantage  over  Tong  and  Lim  (1980)  (by  taking  the 
same  input  dimension,  using  identical  data  sets,  minimizing 
the  same  sum  of  squared  errors,  etc.),  the  multi-step  predic¬ 
tions  were  found  to  be  significantly  better:  on  average,  the 
iterated  prediction  variances  of  the  network  were  about  half 
the  iterated  prediction  variances  of  the  tar  model.  This  con¬ 


cludes  the  comparison  with  the  benchmark  model.® 

Subba  Rao  and  Gabr  (1984)  apply  a  bilinear  modef  to  the 
sunspot  data  and  find  an  improvement  of  about  15%  over  the 
TAR  model,  both  for  single-step  and  iterated  predictions.  On 
predictions  further  than  one  step  into  the  future,  the  networks 
outperform  the  bilinear  model  on  average  by  35%  in  mean 
squared  error. 

Stokbro  (1991)  uses  a  weighted  linear  predictor  (wlp).  In 
a  WLP,  each  primitive  is  the  product  of  a  first  order  polyno¬ 
mial  and  a  normalized  Gaussian  radial  basis  function.  The 
predictor  is  the  linear  superposition  of  these  primitives.  Stok¬ 
bro  compares  wlp  with  the  network  solution  on  the  on  the 
1921  to  1946  prediction  set  given  in  Weigend  et  al.  (1990). 
For  one  and  two  iterations,  both  methods  have  similar  errors. 
When  iterated  more  than  twice,  the  network  outperforms  the 
WLP  model. 

Recently,  Lewis  and  Stevens  (1991)  applied  multivariate  adap¬ 
tive  regression  splines  (MARS)  by  Friedman  (1991)  to  the 
sunspot  series.  We  find  that  the  performance  of  MARS  is 
very  similar  to  the  performance  of  the  network.  Given  that 
the  primitives  of  both  schemes  (sigmoids  and  splines)  are 
smooth,  and  given  that  both  approaches  employ  a  regular¬ 
ization  scheme  that  penalizes  complexity,  the  similar  perfor¬ 
mance  is  not  astonishing  but  rather  encouraging. 

3.3  VARYING  THE  INPUT  DIMENSION 

Up  to  now,  all  predictions  were  based  on  information  of  the 
preceding  twelve  years.  What  happens  if  we  vary  the  input 
dimension?  When  the  number  of  input  units  is  reduced,  we 
expect  the  error  to  increase,  at  least  at  some  stage.  But  when 
the  number  of  input  units  is  increased,  two  effects  compete. 
On  the  one  hand,  more  information  becomes  available,  pos¬ 
sibly  allowing  for  better  predictions.  On  the  other  hand,  the 
higher  the  input  dimension,  the  more  sparsely  distributed  the 
training  data.  Will  the  networks  be  robust  if  more  input  units 
than  necessary  are  present? 

*Thc  discrepancy  between  a  negligible  difference  in  single-step 
prediction  accuracy  and  a  factor  of  two  for  iterated  predictions  is 
interesting.  A  conjecture  (from  a  discussion  with  Jerry  Friedman)  is 
the  following: 

Consider  the  single-step  squared  error  decomposed  into  a  squared 
bias  and  a  variance.  In  this  footnote — in  contrast  to  the  rest  of  the 
paper — the  term  variance  refers  to  the  spread  of  network  solutions, 
see  Geman,  Biencnstock  and  Doursat  (1991).  Since  a  network  is  a 
more  flexible  model  than  a  TAR  models,  the  bias  of  the  network  is 
smaller  than  the  bias  of  TAR.  If  iterating  amplifies  the  squared  bias 
more  than  the  variance,  the  observed  effect  is  explained. 

’in  addition  to  linear  autoregression  (terms  proportional  to  ri_,), 
Subba  Rao  and  Gabr  allow  terms  proportional  to  the  forecasting 
cnors  (t-j  as  well  as  terms  proportional  to  the  product  xt-ktt-i 
(bilinear  interactions).  In  the  framework  of  conncctionist  networks, 
arbitrary  ir.tcr.'.c:!  ans  between  lagged  inputs  Ti_*  and  past  prediction 
enors  tt-i  are  modeled  by  enhancing  the  usual  input  with  a  set  of 
units  representing  r  t-i.  Such  a  network  can  learn  to  extract  possibly 
nonlinear  responses  to  outside  shocks. 
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Figure  8:  Prediction  error  for  sunspots  as  function  of  the  number  of 
inputs  and  forecasting  time. 


The  prediction  error  for  iterated  single-step  predictions  (for 
the  1921-1955  set)  is  shown  in  Fig.  8  as  surface  above  the 
number  of  inputs  and  the  prediction  time  into  the  future. 

Networks  with  one  input  unit  of  lag  one  already  manage  to 
capture  two  thirds  of  the  single-step  variance,  reducing  it 
to  0.33.  The  solution  is  practically  linear  with  an  offset.  Net¬ 
works  with  two  input  units  reduce  the  relative  mean  squared 
error  to  0.17.  They  begin  to  use  the  available  nonlinearities. 
With  increasing  number  of  input  units,  the  error  reaches  a 
roughly  constant  value.  The  performance  docs  not  degrade 
with  input  dimension  several  times  larger  than  necessary:  the 
networks  ignore  irrelevant  information. 


To  summarize,  we  use  a  procedure  called  weight-elimination 
that  addresses  the  related  problems  of  network  size  and  over¬ 
fitting  by  dynamically  eliminating  weights  during  training.  In 
this  paper,  we  focus  on  the  time  series  of  yearly  sunspot  av¬ 
erages  from  the  year  1700  onward.  On  iterated  predictions 
into  the  future,  the  network  performance  turns  out  to  be  very 
similar  to  mars  and  significantly  better  than  other  models. 

We  close  with  two  references  to  further  examples  of  networks 
for  time  series  prediction.  In  Weigend  et  a/.  (1990),  we  ana¬ 
lyze  a  time  series  from  a  computational  ecosystem  and  show 
that  connectionist  networks  can  predict  the  utilization  of  the 
resources  of  the  ecosystem  for  hundreds  of  steps  into  the  fu¬ 
ture.  And  in  Wc'gcnd  et  al.  (1991),  we  apply  networks  to 
the  prediction  of  the  notoriously  noisy  foreign  exchange  rates 
and  show  that  the  key  to  a  solution  there  is  selection  of  the 
relevant  variables  through  weight-elimination. 

We  thank  Jerry  Friedman  and  Art  Owen  for  discussions. 


4  REFERENCES _ 

Akaike,  Hirotugo.  Statistical  predictor  identification.  Ann. 
Institute  of  Statistical  Mathematics,  22:203-217, 1970. 

Baldi,  Pierre  and  Chauvin,  Yves.  Temporal  evolution  of  gen¬ 
eralization  during  learning  in  linear  networks.  Sub¬ 
mitted  to  Neural  Computation,  1991. 

Baldi,  Pierre  and  Homik,  Kurt.  Back-propagation  and  un¬ 
supervised  learning  in  linear  networks.  In  Chauvin,  Y. 
and  Rumelhart,  D.  E.,  editors,  Dackpropagation  and 
Connectionist  Theory.  Lawrence  Erlbaum,  1992. 

Friedman,  Jerome  H.  Multivariate  adaptive  regression 
splines.  The  Annals  of  Statistics,  19:1-141  (with  dis¬ 
cussion),  1991. 

Geman,  Stuart,  Biencnstock,  Elie,  and  Doursat,  Rend.  Neural 
networks  and  the  bias/variance  dilemma.  Submitted 
loNeural  Computation,  1991. 

Lapedes,  Alan  S.  and  Farbcr,  Robert  M.  Nonlinear  signal 
processing  using  neural  networks:  prediction  and 
system  modelling.  Technical  Report  LA -UR-87 -2662, 
Los  Alamos  National  Laboratory,  1987. 

Lewis,  Peter  A.  W.  and  Stevens,  J.  G.  Nonlinear  modeling 
of  time  series  using  multivariate  adaptive  regression 
splines  (MARS).  Submitted  to  Journal  of  the  American 
Statistical  Association,  1991. 

Nowlan,  Steven  J.  Soft  Competitive  Adaptation:  Neural  Net¬ 
work  Learning  Algorithms  based  on  Fitting  Statistical 
Mixtures.  PhD  thesis,  CMU  (Computer  Science),  1991 . 

Priestley,  Maurice  B.  Non-linear  and  N on-stationary  Time 
Series  Analysis.  Academic  Press,  1988. 

Rumelhart,  David  E.,  Hinton,  Geoffrey  E.,  and  Williams, 
Ronald  J.  Learning  internal  representations  by  error 
propagation.  In  Rumelhart,  D.  E.  and  McClelland,  J.  L., 
editors.  Parallel  Distributed  Processing,  pages  3 1 8-362. 
MIT  Press,  1986. 

Subba  Rao,  T.  and  Gabr,  M.  M.  An  Introduction  to  Bispectral 
Analysis  and  Bilinear  Time  Series  Models,  volume  24  of 
Lecture  Notes  in  Statistics.  Springer,  1984. 

Stokbro,  Kurt.  Predicting  chaos  with  weighted  maps.  Tech¬ 
nical  Report  91/10  S,  Nordita,  Copenhagen,  1991. 

Tong.  Howell  and  Lim,  K.  S.  Threshold  autoregression, 
limit  cycles  and  cyclical  data.  Journal  Royal  Statistical 
Society  B,  42:245-292, 1980. 

Tong,  Howell.  Non-linear  Time  Series:  a  Dynamical  System 
Approach.  Oxford  University  Press,  1990. 

Weigend,  Andreas  S.,  Huberman,  Bernardo  A.,  and  Rumel- 
hart,  David  E.  Predicting  the  future:  a  connection¬ 
ist  approach.  International  Journal  of  Neural  Systems, 
1:193-209,1990. 

Weigend,  Andreas  S.,  Huberman,  Bernardo  A.,  and  Rumel¬ 
hart,  David  E.  Predicting  sunspot.s  and  exchange  rates 
with  connectionist  networks.  In  Casdagli,  M.  and  Eu¬ 
bank,  S.  G.,  editors.  Nonlinear  Modeling  and  Forecast¬ 
ing.  Addison-Wesley,  1991. 


92-19591 


llities  on  Pedigrees  371 


N 

Q 

S 

a. 

I 

□ 

< 


Probabilities  on  Complex  Pedigrees; 
the  Gibbs  Sampler  Approach 

Elizabeth  Thompson* 

Department  of  Statistics,  GN-22, 

University  of  Washington 
Seattle,  WA  98195 


Abstract 

The  analysis  of  complex  familial  traits  requires  the  com¬ 
putation  of  likelihoods  for  complex  genetic  models  on 
extended  and/or  complex  pedigrees.  This  challenge  has 
defeated  conventional  computational  algorithms,  but 
the  pedigree  Gibbs  sampler  provides  an  effective  method 
of  Monte  Carlo  evaluation  of  the  required  probabilities 
and  likelihood  ratio  functions. 

Key  Words:  Genetic  models;  Complex  pedigrees; 
Conditional  independence  structure.  Monte  Carlo  sum¬ 
mation;  Importance  sampling;  Gibbs  sampler; 

1  Introduction 

The  objective  is  to  compute  the  probability  of  trait  data 
observed  on  some  subset  of  the  related  members  of  a 
specified  pedigree  structure,  or  the  probability  of  un¬ 
derlying  genotypic  configurations  on  the  pedigree  con¬ 
ditional  upon  trait  data,  in  either  case  under  some  spec¬ 
ified  genetic  model  for  the  trait.  Often  in  the  analysis 
of  complex  familial  traits  the  genetic  models  required 
will  be  complex,  with  several  genetic  and  non-genetic 
factors  contributing  to  the  observed  trait.  On  the  other 
hand,  the  pedigrees  on  which  such  traits  are  analysed 
are  not  necessarily  complex,  even  when  extended  pedi¬ 
grees  of  several  hundred  individuals  are  used  to  help 
to  ensure  genetic  homogeneity  of  the  trait  in  question. 
However,  the  pedigrees  of  genetically  isolated  popula¬ 
tions  are  complex,  and  in  this  paper  we  shall  address  the 
question  of  likelihood  computations  on  complex  pedi¬ 
grees.  By  contrast,  we  shall  restrict  attention  to  simple 
genetic  models,  mainly  for  expository  convenience. 

For  computational  purposes  a  pedigree  is  most  ejisily 
specified  by  giving,  for  every  individual,  the  unique  indi¬ 
vidual  identifiers  of  his/her  mother  and  father.  Graphi¬ 
cally,  a  complex  pedigree  is  represented  most  easily  as  a 
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Figure  1:  An  example  pedigree;  from  Thompson  (1986). 

marriage  node  graph:  an  arc  joins  each  individual  to  his 
parents’  marriage  (or  mating)  and  to  each  of  his  own 
marriages.  Figure  1  shows  a  small  example  pedigree 
in  marriage  node  graph  form.  The  pedigrees  on  which 
we  require  probabilities  and  likelihoods  are  far  larger 
and  more  complex  than  that  of  figure  1:  one  example 
is  discussed  in  section  6.  A  general  characteristic  of  the 
extended  complex  multi-generation  pedigrees  of  genetic 
isolates  is  that  data  are  available  normally  only  for  cur¬ 
rent  individuals,  the  lower  fringe  of  the  pedigree,  and 
often  not  even  for  all  such  individuals.  Although  prob¬ 
lems  of  graph-theoretic  type  arise  in  the  development  of 
algorithms  for  computations  on  pedigrees,  pedigrees  are 
not  arbitrary  graphs.  Chronology  and  marriage  prefer¬ 
ences  often  result  in  substantial  ordering  and  regularity 
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(figure  2).  Note  that  throughout  this  paper  it  is  assumed 
that  the  pedigree  itself  is  known  correctly. 

There  are  two  main  classes  of  purpose  in  computing 
the  probability  of  trait  data  observed  on  a  specified  pedi¬ 
gree  structure  under  a  specified  genetic  model.  The  first 
is  where  the  genetic  model  for  the  trait  is  known.  In 
this  case,  of  interest  are  genetic  counselling  probabili¬ 
ties;  probabilities  that  specified  individuals  carry  genes 
that  cause  them  to  be  at  risk  of  developing  a  certain 
disease,  or,  jointly  with  a  specified  partner,  have  off¬ 
spring  at  risk  of  developing  the  disease.  Related  to  these 
probabilities  are  ancestral  inference  probabilities;  pos¬ 
terior  probabilities,  given  the  trait  data,  that  the  genes 
now  exhibiting  effects  in  current  individuals  entered  the 
pedigree  through  certain  founder  individuals,  and/or  de¬ 
scended  via  certain  ancestral  paths.  The  second  purpose 
of  probability  computation  is  where  the  genetic  model  is 
not  known,  and  the  probability  of  trait  data  under  some 
model  is  required  as  the  likelihood  of  that  model,  to  aid 
in  likelihood  inference  of  the  true  genetic  model  under¬ 
lying  the  trait.  Examples  of  each  of  these  two  classes  of 
problem  will  be  given  in  section  6. 

2  Genetic  models 

The  elements  of  genetic  models  are  straightforward; 
genes  exist,  genes  segregate  (are  copied)  from  parents 
to  offspring,  and  the  types  of  genes  carried  by  an  indi¬ 
vidual  influence  observable  trait  characteristics. 

a)  Genes  exist;  For  the  simplest  genetically  deter¬ 
mined  traits,  each  individual  carries  two  genes,  which 
many  be  of  any  of  a  number  of  types.  The  simplest  pos¬ 
sibility  is  that  there  are  two  types,  say  A  and  B,  and 
in  this  caee  an  individual  has  three  possible  combina¬ 
tions  of  types  of  genes,  or  genotypes,  AA,  AB  or  BB. 
The  frequencies,  or  prior  probabilities,  for  the  types  of 
genes  are  parameters  of  the  model.  Genotypes  for  an 
individual  i  will  be  denoted  G,.  Individuals  whose  par¬ 
ents  are  not  pedigree  members,  founders,  have  genotype 
probabilities  Ps{Gi),  under  a  genetic  model  indexed  by 
parameters  0. 

b)  Genes  segregate;  in  modern  terminology, 
Mendel’s  first  law  (1866)  states  that  one  of  the  two 
genes  that  an  individual  carries  for  a  trait  is  a  copy 
of  a  random  (equiprobably  chosen)  one  of  the  two  in 
his  father,  the  other  a  copy  of  a  random  one  of  the 
two  in  his  mother,  and  that  a  random  one  of  the  two 
he  carries  will  be  copied  to  each  child,  independently 
for  each  child.  For  any  individual  i  with  parents  M, 
and  Fi,  Mendel’s  law  provides  the  .segregation  probabil¬ 
ities  PiGi\GM,,GF,).  For  example,  P(Gi  =  AA\Gm,  = 
AA,Gf,  —  AB)  =  1/2.  Thus,  segregation  of  the  genes 


for  simple  Mendelian  traits  do  not  provide  any  addi¬ 
tional  unknown  parameters  of  the  model.  However,  as 
described  below,  one  of  the  most  important  parameters 
of  modern  genetic  analyses  is  a  segregation  parameter. 

c)  Genes  influence  traits;  For  the  purposes  of  this 
paper,  we  atssume  such  influences  can  be  summarised  by 
probabilities  Ps(yi\Gi)  where  yi  is  the  observed  (quali¬ 
tative  or  quantitative)  trait  value  of  individual  i.  Such 
probabilities  are  known  as  penetrance  probabilities,  and 
their  specification  may  in'’olve  unknown  parameters. 

We  shall  subsume  all  the  parameters  of  the  genetic 
model  into  the  parameter  vector  9,  and  use  Pe(-)  to  de¬ 
note  probabilities  under  the  model.  The  total  set  of 
genotypes  on  a  pedigree  will  be  denoted  G,  and  of  ob¬ 
served  phenotypes  y. 

Many  genetic  analyses  are  concerned  with  linkage 
analysis,  the  objective  being  inference  of  the  location 
within  the  genome  of  genes  controlling  a  trait  of  interest, 
by  observing  cosegregation  with  DNA  markers  whose 
position  in  the  genome  is  known.  The  recombination 
frequency  between  trait  genes  and  genes  determining  a 
marker  trait  is  the  frequency  with  which  genes  for  the 
two  traits  passed  on  by  an  individual,  i,  to  an  offspring 
derive  from  the  two  different  parents  of  i.  If  the  genes 
determining  the  two  traits  are  located  close  together  in 
the  genome,  this  frequency  is  close  to  0,  while  if  they  are 
far  apart,  or  on  different  chromosomes,  the  frequency  is 
1/2.  Thus  the  recombination  frequency,  or  linkage  pa¬ 
rameter,  r,  ranges  from  0  to  1/2,  and  determines  the 
segregation  probability  Pf(GilGju,,Gr,),  where  now  the 
genotypes  refer  to  the  combined  genotypes  for  both  trait 
and  marker.  Estimation  of  r,  or  testing  of  the  hypothe¬ 
sis  r  =  i,  is  a  frequent  objective  of  genetic  analyses;  an 
example  will  be  given  in  section  6. 

3  Exact  likelihood  computation 

The  probability  of  data  observed  on  the  pedigree,  com¬ 
puted  under  a  specified  genetic  model,  or  the  likelihood 
of  that  model,  can  be  written  as 

L(0)  =  p,(y)  =  ^  P,(ylG)P,(G)  (3.1) 

G 

For  simple  genetic  models 

/^«(y|G)  =  n^»(j/i|G'.)  (3.2) 

i 

where  the  penetrance  probability  is  interpreted  as  unity 
for  individuals  i  for  whom  no  trait  data  are  observed, 
and 

P<,(G)  =  nP<,(G.|GM.,Gf-.)  (3.3) 
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where  A/,-  and  F,  are  the  parents  of  i,  and  for  founders 
i  the  segregation  probability  is  to  be  interpreted  as 
the  population  probability  Pj(G,).  Thus  neither  term 
within  the  sum  (3.1)  is  difficult  to  compute.  The  difi- 
culty  lies  only  in  the  summation  over  all  genotypic  con¬ 
figurations  G  on  the  pedigree. 

The  most  successful  algorithms  previously  developed 
for  evaluating  (3.1)  were  first  described  by  Hilden 
(1970),  Elston  and  Stewart  (1971),  and  Heuch  and  Li 
(1972).  The  approach  was  generalised  by  Cannings, 
Thompson  and  Skolnick  (1978).  In  quite  other  contexts, 
similar  approaches  have  more  recently  been  developed 
by  Lauritzen  and  Spiegelhalter  (1988).  The  method 
rests  on  the  conditional  independence  structure  of  ge¬ 
netic  models:  given  the  types  of  the  genes  carried  by 
the  spouses,  parents  and  offspring  of  an  individual,  the 
genotype  and  data  observation  on  that  individual  are 
independent  of  all  other  data  and  genotypes  in  the  pedi¬ 
gree.  This  conditional  independence,  guaranteed  by  the 
model  specifications  of  section  2,  is  fundamental  also  to 
the  Gibbs  sampling  approach  of  section  5. 

Another  expression  of  the  same  fact  is  that,  con¬ 
ditional  on  the  types  of  the  genes  carried  by  indi¬ 
viduals  who  constitute  a  cutset,  dividing  the  pedigree 
into  two  or  more  disjoint  components,  the  trait  data 
observed  on  the  each  component,  and  on  the  cutset, 
are  jointly  independent.  In  our  small  example  pedi¬ 
gree  (figure  1),  {31,25,23}  is  a  cutset;  conditional  on 
the  cutset  genotypes,  data  observed  on  the  three  sets, 
{22,24,26,27,28,29,30},  {31,25,23}  and  the  remain¬ 
der,  are  independent.  The  pair  {13,21}  is  also  a  cutset, 
dividing  {3,4,14,15,19,20}  from  the  remainder  of  the 
pedigree.  A  single  individual  (e.g.  12)  can  be  a  cutset. 

The  conditional  independence  exhibits  itself  in 
(3.1)  through  the  fact  that  terms  involving  mem¬ 
bers  of  one  component  of  a  pedigree,  for  example 
{3,4,14,15,19,20}  involve  additionally  only  the  rele¬ 
vant  cutset  members — the  spouses,  parents  or  offspring 
of  some  member  of  the  component  (in  this  example 
{13,21}).  Thus  the  summation  (3.1)  can  be  accom¬ 
plished  sequentially  through  the  pedigree,  considering 
only  a  few  individuals  at  each  stage  and  producing  at 
each  stage  a  real- valued  function  defined  on  the  possible 
genotypic  configurations  of  a  cutset.  These  functions 
were  called  72-functions  by  Cannings,  Thompson  and 
Skolnick  (1978).  Specifically,  summation  of  all  terms  in¬ 
volving  individuals  3  and  4  can  be  accomplished  for  each 
genotypic  configuration  of  the  cutset  {13,14}.  Then  this 
/2-function  can  be  incorporated  into  summation  over  the 
possibilities  for  individuals  14  and  15,  producing  a  new 
/2-function  on  {13, 19},  and  thirdly  stimmation  over  19 
and  20  produces  an  /2-function  on  {13,21}.  In  this  way 


it  is  possible  to  work  through  an  entire  complex  pedi¬ 
gree,  accomplishing  finally  the  summation  (3.1). 

This  method  has  proved  successful  in  analysing  traits 
on  a  number  of  large  and  complex  pedigrees,  but  it  is 
severely  bounded  in  a  way  that  seems  unlikely  to  be 
much  relieved  by  increased  computing  capacity.  For  the 
simplest  genetic  models,  an  individual  has  three  possi¬ 
ble  genotypes;  for  the  simplest  linkage  model  there  are 
ten.  Where  there  are  k  possible  genotypes,  for  a  cutset 
of  n  individuals,  there  are  k"  possible  genotypic  con¬ 
figurations;  the  /2-function  has  ifc"  discrete  values.  Nor 
can  the  problem  be  resolved  by  including  more  indi¬ 
viduals  in  each  sequential  summation;  if  N  cutset  and 
non-cutset  individuals  are  involved  in  a  given  step,  there 
are  (at  least  in  principle)  k^  terms  to  be  considered  in 
the  summation.  Since  1978  the  increase  in  computing 
power  has  enabled  us  to  extend  from  the  initial  pro¬ 
grams  with  cutsets  of  size  8  (3®  =  6561)  to  cutsets  of 
size  14  (or  13  with  double  precision;  3*®  =  1,594,323). 
But  large  complex  pedigrees  cannot  always  be  resolved 
with  cutsets  of  size  no  more  than  13,  and  for  a  genetic 
model  with  ten  genotypes  even  cutsets  of  size  8  remain 
impossible. 

For  more  details  of  the  peeling  approach.  Cannings, 
Thompson  and  Skolnick  (1978)  give  the  theory,  and  ex¬ 
amples  are  discussed  by  Thompson  (1986).  We  have 
given  here  only  a  sufficient  description  to  demonstrate 
the  need  for  other  approaches  and  to  provide  a  basis 
for  the  discussion  of  section  7.  One  later  requirement 
will  be  the  probabilistic  interpretation  of  an  /2-function. 
Where  the  component  for  which  summation  has  been 
accomplished  (the  peeled  set,  Q)  contains  no  parents 
of  cutset  individuals,  the  interpretation  is  straightfor¬ 
ward;  each  term  is  simply  the  probability  of  data  ob¬ 
served  on  Q  conditional  on  that  particular  genotypic 
configuration  on  the  cutset  C.  For  example,  in  figure 
1,  with  Q  =  {22,24,26,27,28,29,30}  each  term  of 
/^(Gai,  G25,  G23)  is  the  probability  of  data  observed  on 
Q  given  the  particular  configuration  of  genotypes  on 
C  =  {31,25,23}.  Where  parents  of  i  €  C  are  in  Q,  the 
probability  is  joint  with  the  relevant  genotype  of  i,  and 
it  is  further  important  to  recognise  that  the  /2-function 
incorporates  only  probabilities  resulting  from  genealog¬ 
ical  relationships  within  Q.  Thus 

/i(Gi3,  G21) 

=  prob*(data  on  {3,4, 14, 15, 19, 20},  G13IG21) 

(3.4) 

where  prob*  denotes  the  fact  that  this  probability  is 
computed  only  on  the  subpedigree  consisting  of  Q  and 
C — that  is,  it  incorporates  that  13  is  the  great  aunt  of 
21 ’s  offspring  20,  but  not  that  she  is  also  the  spouse  of 
2rs  father,  12,  through  whom  there  is  also  dependence 
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by  virtue  of  any  data  on  their  joint  descendants  16  and 
18.  Note  also  that  if  one  parent  of  a  cutset  individual  is 
in  Q  the  other  must  either  be  in  Q  or  else  in  C. 


4  A  Monte  Carlo  approach 

An  alternative  approach  to  the  summation  (3.1)  is  a 
Monte  Carlo  one.  Simulation  from  7’#(G)  is  straight¬ 
forward,  by  assigning  genes  to  the  founders  of  the  pedi¬ 
gree,  with  appropriate  probabilities,  and  simulating  the 
Mendelian  segregation  of  those  genes.  A  set  of  M  reali¬ 
sations  {  ;  i  =  1, . . . ,  Af  }  provides  a  simple  Monte 

Carlo  estimate 

lf;/>,(y|GW).  (4,1) 

1=1 

However,  this  estimate  is  useless  on  a  large  or  complex 
pedigree,  particularly  where  data  are  confined  to  the 
lower  part  of  the  pedigree.  There  are  huge  numbers  of 
genotypic  configurations  on  a  pedigree.  Normally  only 
a  minute  proportion  are  even  compatible  with  the  data 
(give  a  non-zero  value  of  (3.2)),  and  even  ^hese  are  likely 
to  give  negligible  contribution  to  the  likelihood  (3.1). 

Another  proposal  was  made  by  K.  Lange  in  Ott 
(1979);  namely,  to  rewrite  (3.1)  as 

m  =  E  ^^(y|G)^^P»o(G)  (4.2) 

G 


and  to  simulate  { G^*^  :  i  =  1,...,M}  from  P«„(G), 
giving  a  Monte  Carlo  estimate 


M 


M 


»=i 


P»(GW) 

no(G(-)) 


(4.3) 


This  is  likewise  not  effective  on  a  large  pedigree  for 
generate  realisations  even  less  related  to 
Ps(y|G)  than  does  P8(G),  but  it  contains  the  seeds  of 
two  key  ideas.  The  first  is  that  of  sampling  according  to 
some  other  distribution  and  reweighting  the  summand 
accordingly — that  is,  of  importance  sampling.  The  sec¬ 
ond  is  that  by  simulation  at  a  single  Oq,  Monte  Carlo 
estimates  of  an  entire  function  L{0)  are,  in  principle, 
obtainable. 

Pursuing  the  idea  of  importance  sampling,  what 
would  be  the  optimal  simulation  distribution?  Since 
P»(y|G)Pj(G)  is,  as  a  function  of  G,  proportional 
to  P#(G|y),  we  require  something  of  similar  form  to 
P«(G|y),  for  example  P#„(G|y)  for  some  similar  to 
0.  Unfortunately,  Ptfo(G|y)  cannot  be  written  down  ex¬ 
plicitly;  evaluation  is  eo-'  -alent  to  that  of  (3.1).  And 
how  do  we  sample  from  this  distribution?  Deferring  the 


latter  question,  consider  first  the  form  that  the  estimate 
would  take.  The  likelihood  (3.1)  or  (4.2)  is  also 

=  (^.4) 

G 

and  thence,  by  Bayes  theorem, 


m  =  E 

G 


P,(ylG)  P,(G) 
P,„(ylG)  P,„(G) 


n„(y)Ps„(G|y) 


(4.5) 


Now  Pso(y)  is  also  unknown;  again  an  evaluation  equiv¬ 
alent  to  (3.1)  is  required.  However,  by  definition,  this 
probability  is  the  likelihood  //(^o)-  In  likelihood  in¬ 
ference,  likelihood  ratios  are  sufficient.  Thus  rewriting 
(4.5)  as 


m 

L(0o) 


Pg(y|G)  Pg(G) 
PMG)  Ps„(G) 


PBo{G\y) 


(4.6) 


we  can  obtain  a  Monte  Carlo  estimate  of  the  likelihood 
ratio  L{0)/L(0o), 


1  ^  Ps(y|GU))  P»(G«)) 

M  ^  P<»o(y|G('))  P«„(G(>)) 


where  now  the  G^‘)  are  realisations  from  P#o(G|y). 
Moreover,  from  a  single  set  of  realisations  at  we  can 
obtain  estimates  of  L{0)/L{0o)  for  many  different  values 
oU. 

Note  also  that  if  the  {0  :  0o)  difference  lies  only  in  the 
segregation  probabilities,  it  is  not  even  necessary  to  be 
able  to  compute  Pj(y|G).  If  Ps(ylGU))  =  Pj^(y|G(‘)), 
(4.7)  reduces  to 


1  ^  P»(G(U) 
^  Pso(G(0) 


(4.8) 


where  the  data  y  now  enter  only  through  the  sampling 
of  the  G<*^  from  P(j„(G|y).  Thus,  in  particular,  linkage 
analysis  for  complex  traits  is  possible  (Guo  and  Thomp¬ 
son,  1991b);  an  example  is  given  in  section  6. 

The  development  of  this  section  presupposes  the  avail¬ 
ability  of  realisations  from  the  global  posterior  condi¬ 
tional  distribution  of  genotypic  configurations  on  the 
pedigree,  conditional  on  the  observed  data.  The  next 
section  will  complete  the  picture  by  describing  how  the 
pedigree  Gibbs  sampler  provides  such  realisations.  Ob¬ 
taining  such  realisations  we  solve  also  the  problem  of  es¬ 
timation  of  risk  probabilities  on  pedigrees,  for  such  prob¬ 
abilities  are  precisely  specified  margin^tis  of  this  condi¬ 
tional  distribution.  Estimates  are  thus  provided  by  rel¬ 
ative  frequency  counts  in  the  realisations.  An  example 
is  given  in  section  6. 


Monte  Carlo  Probabilities  on  Pedigrees  375 


5  The  pedigree  Gibbs  sampler 

Thus  the  remaining  task  is  to  obtain  the  realisations 
from  Pjo(y|G)  required  for  (4.7)  above.  This  can  be 
achieved  by  a  Gibbs  sampler  (Hastings,  1970)  on  the 
genotypes  of  the  individuals  of  the  pedigree.  The  Gibbs 
sampler  has  recently  become  widely  used  in  image  anal¬ 
ysis  (Geman  and  Geman,  1984),  a  situation  in  which 
the  global  conditional  distribution  of  true  image  con¬ 
ditional  on  data  observations  cannot  be  computed  nor 
directly  simulated  from,  but  in  which  the  local  condi¬ 
tional  distributions  are  easily  specified  and  easy  to  sim¬ 
ulate  from.  This  is  the  situation  in  pedigree  analysis. 
Although  P#o(G|y)  cannot  be  evaluated, 

P,„(G.ly,G_.)  =  Peo{Gi\yr,Gu.)  (5.1) 

where  G_,-  denotes  the  genotypes  of  all  individuals  other 
than  i,  and  Gn,  the  genotypes  on  the  neighbours  of 
i,  which,  from  section  3,  comprise  his  parents,  spouses 
and  offspring.  Moreover,  the  probability  (5.1)  is  propor¬ 
tional,  as  a  function  of  G,,  to  the  product  of  penetrance 
probability  Pgo(yi\Gi)  and  the  segregation  (or  founder) 
probabilities  for  triplets  for  i  =  j,  and  for 

i  =  Mj  or  Fj. 

Thus  one  possible  implementation  of  the  Gibbs  sam¬ 
pler  is  as  follows: 

Start  from  a  genotypic  configuration  on  the  pedigree 
for  which  P#o(G|y)  >  0. 

Take  a  random  permutation  of  the  individuals  in  the 
pedigree.  For  each  individual,  according  to  this  per¬ 
mutation,  update  Gi  in  the  current  configuration  G  by 
sampling  from  the  local  conditional  distribution  (5.1). 
We  refer  to  this  procedure  as  one  random  scan  of  the 
pedigree. 

We  now  perform  repeated  random  scans,  taking  a  now 
permutation  of  the  pedigree  members  each  time.  This 
process  defines  a  Markov  chain  on  the  space  of  genotypic 
configurations  for  which  f’«o(G|y)  >  0,  and  P«o(G|y)  is 
an  equilibrium  distribution  of  this  Markov  c  ain.  Pro¬ 
vided  the  Markov  chain  is  irreducible,  the  configuration 
after  successive  scans  converges  in  distribution  to  the 
required  global  conditional  distribution,  and  dependent 
realisations  G^’^  can  be  obtained  by  sampling  the  chain 
after  a  sufficient  initial  period  for  the  convergence  in 
distribution  to  be  approximately  accomplished. 

There  are  many  details  of  the  above  procedure  not 
fully  detailed  here.  First  a  feasible  initial  configuration 
must  be  found;  this  is  usually  not  hard.  Second,  the 
chain  must  be  irreducible;  in  general  this  is  a  problem, 
but  irreducibility  does  obtain  for  the  examples  of  this 
paper  (among  many  others).  Third,  decisions  mi]-,t  be 
made  as  to  the  sampling  of  the  chain;  here  our  guidelines 
are  very  preliminary.  In  cases  where  the  convergence  is 


likely  to  be  slow,  such  as  the  Hutterite  example  below, 
we  have  used  an  initial  period  of  4000  scans.  Where  the 
convergence  in  distribution  of  the  chain  realisations  can 
be  shown  to  be  faster,  as  in  the  linkage  example  below, 
an  initial  period  of  400  scans  suffices. 

Thereafter  the  chain  should  be  sampled  at  a  frequency 
that  depends  on  the  trade-off  between  the  autocorrela¬ 
tion  between  successive  scans,  and  the  amount  of  com¬ 
putation  to  be  performed  with  each  sampled  realisation 
(Geyer,  this  volume).  For  the  Hutterite  example  be¬ 
low,  we  sample  every  scan,  since  we  are  merely  counting 
aspects  of  each  realisation.  Nonetheless,  long  runs  will 
still  be  needed,  to  ensure  the  space  is  well  sampled;  high 
autocorrelation  on  a  large  pedigree  means  that  many 
scans  are  needed  to  traverse  the  space.  For  the  linkage 
example,  where  more  computation  is  needed  from  each 
sample,  we  sample  only  every  20  scans.  For  this  par¬ 
ticular  example,  more  frequent  sampling  might  well  be 
justified,  but  for  likelihood  analysis  of  complex  genetic 
models  is  has  been  found  that  such  an  interval  between 
samples  may  be  necessary  (Guo  and  Thompson,  1991a). 

6  Two  examples 

In  this  section,  two  examples  are  given.  The  first  is  of 
the  performance  of  the  Gibbs  sampler  itself,  through  its 
use  in  providing  posterior  probabilit’''son  a  583-member 
section  of  the  Hutterite  genealogy.  The  second  example 
shows  the  use  of  Gibbs  sampler  realisations  in  providing 
Monte  Carlo  estimates  of  likelihood  curves  for  a  genetic 
linkage  model.  Each  example  is  only  a  preliminary  so¬ 
lution  to  the  problem  it  addresses. 

The  Hutterite  population  is  a  North  American  reli¬ 
gious  and  genetic  isolate,  now  numbering  over  25,000 
but  descended  from  only  about  77  founders  some  10 
generations  ago.  Cystic  fibrosis,  a  simple  recessive  ge¬ 
netic  disease,  has  a  high  frequency  in  this  population, 
and  the  ancestry  of  the  haplotypes  carrying  the  cystic 
fibrosis  (CF)  gene  is  of  some  interest  (Fujiwara  et  al., 
1989).  The  pedigree  of  11  current  cystic  fibrosis  cases  is 
shown  in  figure  2  (see  also  Fujiwara  et  al.  1989)  these 
11  individuals  trace  to  62  founders.  This  pedigree  of 
583  individuals  is  far  too  complex  to  peel,  but  is  well 
suited  to  Gibbs  sampling.  The  data  consist  of  the  11 
individuals  known  to  carry  two  copies  of  the  CF  gene; 
additionally,  since  cystic  fibrosis  is  lethal,  no  ancestor 
can  carry  two  copies  of  the  CF  gene.  Here  we  consider 
only  estimation  of  the  marginal  probability  that  each 
of  the  62  founders  carries  the  gene.  This  is  achieved 
by  running  the  Gibbs  sampler,  starting  from  an  initial 
configuration  in  which  all  ancestors  carry  one  CF  gene, 
and  ennumerating,  after  every  scan,  the  founders  who 
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Figure  2;  The  Hutterite  Cystic  Fibrosis  pedigree. 

:arry  the  gene.  The  excess  of  carriers  is  very  quickly 
sliminated  from  the  configuration,  particularly  when  a 
founder  allele  frequency  of  0.025  (a  typical  European 
kfalue)  is  assumed  for  the  CF  gene. 

The  results  displayed  here  are  preliminary,  and  illus¬ 
trative  of  the  method  only.  We  have  considered  only 
the  ancestors  of  CF  cases,  and  not  the  information  also 
provided  by  many  unaffected  lateral  relatives.  Second, 
these  11  cases  are  not  the  only  identified  CF  cases  in 
the  Hutterite  population.  Third,  we  have  not  made  use 
jf  information  (Fujiwara  et  al.,  1989)  on  closely  linked 
DNA  markers.  Additionally,  we  have  not  made  use  of 
ihe  symmetry  between  members  of  a  founder  couple, 
)refering  to  use  this  as  a  partial  check  on  the  results 
obtained,  rather  than  as  a  constraint. 

Four  runs  each  of  50,000  sampled  random  scans  were 
obtained,  and  the  count  of  the  number  of  times  each 
>f  the  62  founders  was  exhibited  as  a  CF  carrier  were 
tabulated.  Figure  3a  shows  a  plot  of  the  counts  for 
)ne  of  the  runs  against  the  total  for  the  other  three, 
showing  broad  agreement,  but  also  considerable  varia- 
.ion.  In  the  3-run  totals  couples  were  generally  in  good 
igreement.  The  6  extreme  points  constitute  3  founder 
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a:  CF-carrier  count  in  3  runs  of  50,000  scans 
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b;  CF-carrier  counts  in  200,000  scans  for  62  founders 


Figure  3;  CF-carrier  count  Gibbs  sampler  results  for  62 
Hutterite  founders. 

couples;  only  one  of  these  shows  substantial  discrepancy 
between  the  estimates  for  the  two  members,  even  for  the 
single  run  of  only  50,000  scans.  Figure  3b  shows  the  his¬ 
togram  of  counts  for  the  62  founders.  The  three  couples 
with  CF-carrier  probability  estimates  greater  than  0.15 
(30,000/200,000)  stand  as  outliers.  Although  these  re¬ 
sults  are  preliminary,  this  is  a  particularly  encouraging 
result;  there  are  good  population  genetic  reasons  (Fu¬ 
jiwara  et  al.,  1989)  for  assuming  that  there  must  have 
been  at  least  3  original  CF  genes  in  this  population. 

Our  second  example  shows  the  estimation  of  link¬ 
age  likelihood  curves  from  single  runs  of  a  Gibbs  sam¬ 
pler,  as  described  in  section  4.  The  quantitative  data 
y  were  simulated  on  an  extended  pedigree  of  230  in¬ 
dividuals,  according  to  a  complex  genetic  model  (that 
is.  ^»o(yl^)  i®  straightforward  as  described  in 

section  2,  and  in  fact  is  not  easily  evaluated).  However, 
the  feature  primarily  of  interest  is  linkage  with  a  marker 
locus,  also  simulated,  and  estimation  of  the  LOD  score 
log,0(L(r)/L(l/2))  was  achieved  as  in  equation  (4.8), 
estimating  L(r)/L(ro)  and  L(l/2)/Z,(ro),  where  rg  is 
the  recombination  frequency  used  in  running  the  Gibbs 
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recombination  frequency,  r 

Figure  4;  LOD  score  curves  for  genetic  linkage  example. 

sampler.  The  simulation  value  of  the  recombination  fre¬ 
quency,  r,  is  0.1.  The  maximum  likelihood  estimate, 
under  the  best  fitting  complex  model  within  the  class 
considered  is  0.04.  Two  Gibbs  sampler  runs  were  there¬ 
fore  made;  one  at  the  maximum  likelihood  estimates, 
and  the  other  with  all  other  parameters  at  the  maxi¬ 
mum  likelihood  estimates,  but  with  r  =  0.1.  The  two 
curves  are  shown  in  figure  4. 

Again,  these  are  preliminary  results,  but  demonstrate 
the  feasibility  of  obtaining  an  entire  likelihood  curve 
from  a  single  run  of  the  Gibbs  sampler.  The  similar¬ 
ity  of  shape  of  the  two  curves  is  encouraging.  The  small 
discrepancy  in  height  is  to  be  expected,  since  the  curves 
are  best  estimated  in  the  neighbourhood  of  tq,  and  the 
normalisation  relative  to  r  =  1/2  will  be  more  uncertain. 
In  view  of  this,  th-'  small  magnitude  of  the  difference  in 
the  estimated  maximum  LOD  score  (the  goal  of  so  many 
applied  analyses!),  is  very  encouraging  also. 

7  Discussion 

There  is  clearly  considerable  scope  for  Monte  Carlo 
methods  using  the  Gibbs  sampler  to  address  many  ques¬ 
tions  of  statistical  genetics  arising  in  the  analysis  of 
large  and  complex  pedigrees.  Although,  where  peeling 
is  possible,  there  may  be  no  good  reason  for  resorting  to 
Monte  Carlo  estimates,  there  are  many  cases  where  due 
to  the  complexity  of  the  pedigree,  of  the  model,  or  of 
both,  an  •  xact  result  is  unobtainable.  In  such  cases  the 


Gibbs  sampler  can  often  be  quite  easily  implemented 
and  effectively  employed. 

There  is  considerable  scope  for  the  combination  of 
exact  computational  algorithms  with  Monte  Carlo  ap¬ 
proaches;  specifically,  to  combine  the  Gibbs  sampler 
and  peeling.  One  such  approach  has  been  developed 
by  Kong  (this  volume),  and  permits  multilocus  linkage 
analysis,  which  is  another  important  area  of  modern  ge¬ 
netic  analysis  in  which  exact  computations  have  proved 
intractable  or  impossible.  Peeling  and  Gibbs  sampling 
can  also  be  combined  to  provide  likelihoods  for  other 
complex  genetic  models.  Consider,  for  example,  the 
mixed  model  of  statistical  genetics,  in  which  there  are 
both  heritable  random  effects  (say  z)  and  the  effects 
of  Mendelian  genes  (G).  While  it  is  possible  to  gener¬ 
alise  (4.7)  and  use  a  Gibbs  sampler  to  obtain  realisa¬ 
tions  from  Pg„(z,G|y),  this  would  not  be  an  effective 
method  of  estimation  (Thompson  and  Guo,  1991).  In 
fact,  for  random  effects  models  on  pedigrees  P»(y|G) 
can  be  evaluated  for  any  specified  major-genotypic  con¬ 
figuration  G  by  a  rather  different  form  of  the  peeling 
algorithm.  That  is,  for  models  with  both  heritable  ran¬ 
dom  effects  and  major  genotypic  effects,  (4.7)  can  be 
used  to  estimate  likelihood  ratios  (Thompson  and  Guo, 
1991;  Guo  and  Thompson,  1991a). 

A  third  way  of  combining  the  Gibbs  sampler  and  peel¬ 
ing  is  by  dividing  the  pedigree  rather  than  the  midf' 
On  a  large  complex  pedigree,  it  will  often  be  the  Cck.- 
that  some  portions  can  be  peeled,  providing  /?-functions 
on  cutsets,  each  of  several  individuals  who  are  members 
of  a  core  pedigree  too  complex  to  be  peeled.  As  a  simple 
example,  we  might  peel  the  right-hand  segment  of  the 
pedigree  of  figure  1,  providing  an  i?-function  on  individ¬ 
uals  13  and  21,  but  wish  to  use  the  Gibbs  sampler  on 
the  remainder 

We  wish  to  combine  the  P-functions  into  the  Gibbs 
sampling  to  provide  realisations  on  the  core  pedigree 
that  are  from  the  posterior  distribution  of  core  geno¬ 
types  conditional  on  trait  data  on  the  entire  pedigree. 
In  fact  the  implementation  is  straightforward,  if  we  re¬ 
call  the  interpretations  of  the  /7-functions  given  by  (3.4). 
The  cutset,  C,  consists  of  individuals  whose  parents  are 
not  in  the  peeled  set,  say  Cu  (e.g.  21),  and  others  with 
at  least  one  parent  peeled,  say  Cj  (e.g.  13).  Let  the 
peeled  set  be  Q  and  the  remainder  of  the  pedigree  T. 
For  I  €  C,  let  Kj  denote  the  offspring  of  i  in  T,  and  for  j 
in  Ki  let  Sj  denote  the  other  parent  of  j.  Normally,  Sj  is 
f;  relationships  through  I’s  spouses  in  Q  are  already 
peeled.  For  individuals  in  T  but  not  in  C,  the  Gibbs 
sampling  is  unaffected.  For  cutset  individuals,  the  re¬ 
quired  Gibbs  sampling  of  G,  should  be  conditional  on 
data  yq  and  j/,  and  genotypes  Gx-i,  that  is  in  the  set 
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T  less  the  individual  i.  For  i  in  Cu: 

prob*(yQ,GcJGc.)P(yi|G<) 

X  |Gm.  ,  Gf.  )  a  6K. gt  |Gi .  Gs J 
=  P(!/Q>Gc<,Gi,yi,GK.|Gc.-i,GM.,G/’.,{Gs,}) 
oc  P(Gi|yQ,Gci,yi,GK,,Gc,-i,GM,,G/’.,  {Gs^}) 

=  P(GiiyQ,  yj,  Gt-i) 

and  for  i  in  Cj  we  have  similarly,  without  the  segrega¬ 
tion  from  j’s  parents, 

prob*(yQ,  GcJGc.)P(yi|G,) 

^  n;eK,,SjeT  P(G>|G,, Gsj) 

=  P(j/Q.  Gcj_j,  Gi,  yi,  Gk, |Gc.  ,  {G5, }) 

«  P(G,|yQ,Gc,j_i,yi,GK.,Gc.,  {Gs,}) 

=  P(Gi|yQ,yi,GT-i) 

Thus  Gibbs  sampling  for  members  of  the  cutset  involves 
only  extracting  the  currently  appropriate  term  from  any 
P-function  on  any  cutset  of  which  the  individual  is  a 
member.  This  is  not  a  final  solution;  questions  of  ir- 
reducibility  of  the  Markov  chain  on  the  core  pedigree 
arise,  and  would  have  to  be  resolved  in  any  specific  ex¬ 
ample,  just  as  they  must  in  any  case  be  resolved  for  any 
genetic  model.  However,  given  such  irreducibility,  the 
Gibbs  sampler  provides  realisations  for  risk  assessment 
or  likelihood  evaluation,  just  as  before. 
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Abstract 

Peeling  and  Gibbs  sampling  are  two  computational  tools 
for  genetic  pedigree  analysis.  While  both  are  powerful 
methods,  each  has  its  limitations.  There  are  problems 
where  the  application  of  either  one  technique  alone  will 
not  lead  to  satisfactory  results.  For  some  of  these  prob¬ 
lems,  we  propose  methods  which  combine  peeling  and 
Gibbs  sampling.  The  key  idea  is  to  take  full  advantage 
of  the  strengths  of  each  method  and  eliminate  the  weak¬ 
nesses. 

Key  Words:  Pedigree  analysis,  Peeling,  Gibbs  sam¬ 
pling,  Monte  Carlo,  Markov  chain.  Likelihoods,  Lod 
score,  Bayesian  inference. 

1  Introduction 

Peeling  is  a  standard  computational  tool  geneticists  used 
for  pedigree  analysis  (Elston  and  Stewart  1971,  Lange 
and  Elston  1975,  Cannings  et  al  1978).  While  it  is  a  pow¬ 
erful  method,  there  are  also  limitations.  For  example, 
for  problems  which  involve  multiple  loci  or  complicated 
genetic  models  with  many  parameters,  simple  applica¬ 
tion  of  the  peeling  algorithm  can  be  infeasible  or  im¬ 
practical  due  to  the  limitations  of  memory  and  speed  of 
computations.  Another  technique,  the  Gibbs  sampler, 
which  had  been  used  extensively  in  statistical  physics 
and  image  reconstruction  (see  Geman  and  Geman  1984 
and  Gelfand  and  Smith  1990),  has  recently  found  iU 
way  into  the  genetics  literature  (Thompson  and  Wijs 
man  1990).  The  Gibbs  sampler  is  an  iterative  technique 
which  allows  us  to  draw  multiple,  but  dependent,  real- 
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Computations  for  this  document  were  performed  using  computer 
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DMS  84-04941  awarded  to  the  Department  of  Statistics  at  The 
University  of  Chicago,  and  by  The  University  of  Chicago  Block 
Fund. 


izations  of  the  unobserved  data  (and  sometimes  param¬ 
eters  in  a  Bayesian  setting)  conditioned  on  the  observed 
data.  When  applied  to  pedigree  analysis,  the  drawn 
samples  can  be  used  to  get  estimates  of  likelihood  ratios 
and,  in  a  Bayesian  setting,  posterior  distributions  and 
posterior  odds.  A  potential  weakness  of  the  Gibbs  sam¬ 
pler  is  that  the  samples  it  generated  can  be  too  highly 
correlated  so  that  the  resulting  Monte  Carlo  estimates 
can  be  very  far  from  the  actual  values  without  the  user 
noticing  it.  For  a  large  class  of  problems  which  cannot 
be  handled  very  well  by  either  peeling  or  Gibbs  sampling 
alone,  we  propose  combining  the  two  two  techniques  to 
achieve  a  satisfactory  result. 

In  section  2  we  will  give  a  brief  review  of  peeling  and 
highlight  its  limitations.  Section  3  gives  a  brief  descrip¬ 
tion  of  Gibbs  sampling  and  also  discusses  the  problem 
of  convergence.  Section  4  and  5  contain  two  examples 
which  illustrate  how  peeling  and  Gibbs  sampling  can  be 
combined.  Section  6  has  some  final  remarks. 

2  Pedigree  Analysis  and  Peeling 

Pedigree  analysis  belongs  to  a  class  of  problems  in  statis¬ 
tics  common  known  as  missing  data  problems.  Consider 
a  pedigree  with  n  individuals.  For  i  =  l,...7i,  let  (/, 
denote  the  genotype  of  person  i  and  y,  the  observed 
phenotype.  Depending  on  the  problem,  both  gi  and  y, 
may  involve  a  single  locus  or  multiple  loci  on  the  chro¬ 
mosomes.  The  joint  distribution  of  the  yj’s  and  y,  can 
usually  be  written  as 

n 

p»,-)(g’y)  =  p#(g)nP''(^^'l^''^ 

>=1 

where  6  is  the  recombination  fraction(s)  between  loci 
and  g  is  the  parameter  vector  associated  with  the  genetic 
model  relating  y,  and  y,  .  (Thompson  1986  is  a  good 
reference  for  standard  terminology.)  When  both  g  and 
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y  are  given,  (1),  interpreted  as  a  function  of  9  and  t), 
is  called  the  complete  data  likelihood.  However,  usually 
only  y  is  observed  and  g  is  referred  to  as  the  missing 
data.  The  likelihood  function  based  on  the  observed 
data  only  can  be  written  as 

n 

=  pg,r,(y)  =  ^[pe(g)'[lpr,(!/iiff,)].  (2) 

g  >=i 

The  sum  is  over  all  genotype  vectors  that  are  compat¬ 
ible  with  the  observed  data.  For  any  fixed  values  of 
9  and  rj,  each  term  in  the  sum  is  trivial  to  compute. 
However,  since  in  general  there  will  be  many  genotype 
vectors  g  which  are  compatible  with  the  observed  data, 
summing  by  brute  force  is  computationally  infeasible  ex¬ 
cept  for  very  small  pedigrees.  Let  V  =  {1, . .  .,n}  and 
V'  =  {I'lperson  i  has  no  parents  in  the  pedigree},  the 
later  usually  referred  to  as  the  set  of  founders.  The 
joint  distribution  of  the  j,  ’s  has  the  factorization 

n  P«i9i\3f.,gm,)  (3) 

«€V'  iev-v' 

where  /,  and  m,  denote  the  father  and  mother  of  per¬ 
son  i  respectively.  This  factorization  reflects  the  fact 
that  the  genetic  material  of  a  person  is  inherited  from 
his/her  parents.  By  taking  advantage  of  (3),  for  pedi¬ 
grees  without  loops,  peeling  breaks  down  the  global  sum 
(2),  which  involves  the  joint  outcome  space  of  all  the 
gi’s,  into  a  sequence  of  local  sums,  each  one  involving  at 
most  the  genotypes  of  three  persons  (two  parents  and 
an  offspring).  The  basic  idea  is  to  sum  out  (peel)  one 
person  at  a  time.  Peeling  is  related  to  the  Kalman  filter 
and  has  applications  other  than  genetics  (Lauritzen  and 
Spiegelhalter  1988). 

The  method  of  peeling  works  very  well  for  many  prob¬ 
lems,  but  can  face  difficulties  in  the  following  situations: 

A.  PEDIGREES  WITH  MANY  LOOPS.  Pedigrees 
with  many  loops  due  to  inbreeding  can  be  found  in  stud¬ 
ies  of  rare  recessive  diseases.  Loops  create  problems  be¬ 
cause  it  complicates  the  dependencies  among  relatives. 
Peeling  can  still  be  used,  but  instead  of  local  computa¬ 
tions  involving  three  people  at  a  time,  sometimes  com¬ 
putations  have  to  be  done  on  four  or  more  people  simul¬ 
taneously.  When  there  are  too  many  inbreeding  loops, 
the  memory  and  computations  requirements  can  reach 
a  stage  where  peeling  is  either  infeasible  or  at  the  least, 
impractical. 

B.  MULTIPLE  LOCI.  There  are  problems  in  genet¬ 
ics,  such  as  multi-point  linkage  analysis,  where  many 
linked  loci  have  to  be  handled  simultaneously.  Suppose 
we  are  dealing  with  loci  1  to  k  which  have  respectively 
mj  ,m2,  ■ .  ■  ,mic  number  of  alleles.  The  number  of  states 


the  composite  (involving  ail  the  loci)  genotype  of  an  in¬ 
dividual  can  have  is  approximately 

(4) 

It  can  be  easily  seen  that  even  if  there  is  no  inbreeding  so 
that  we  only  have  to  handle  three  people  at  a  time,  the 
amount  of  computations  required,  which  is  proportional 
to  the  cube  of  (4),  will  quickly  exceed  our  capability  as 
the  number  of  loci  increases.  We  propose  in  Section  4  a 
method  to  handle  this  class  of  problems. 

C.  MODELS  WITH  MANY  PARAMETERS.  Some¬ 
times  the  genetics  models  can  be  very  complicated  and 
involve  many  parameters.  Even  if  it  is  possible  to  peel 
the  pedigree  for  any  given  set  of  parameter  values,  which 
sometimes  is  not  the  case,  this  only  gives  you  one  point 
of  the  likelihood  function  defined  on  a  high  dimensional 
space.  It  may  require  many  of  these  point  by  point  eval¬ 
uations  for  us  to  get  the  maximum  likelihood  estimate 
and  other  inference  tools  such  as  lod  scores  (basically 
generalized  likelihood  ratios).  This  can  become  imprac¬ 
tical,  if  not  infeasible,  for  problems  with  many  parame¬ 
ters.  For  problems  of  this  type,  peeling  has  to  be  coupled 
with  other  techniques  such  as  the  EM  algorithm  (Lan¬ 
der  and  Green  1987)  or  the  Gibbs  sampler.  An  example 
of  the  later  is  presented  in  Section  5. 

3  Monte  Carlo  Approximations 
of  Likehood  Ratios  and  Gibbs 
Sampling 

Instead  of  evaluating  likelihoods  exactly  using  methods 
such  as  peeling,  for  many  problems,  it  is  often  ade¬ 
quate  if  multiple  realizations  of  the  unobserved  geno¬ 
types  can  be  drawn  jointly  conditioned  on  the  observed 
data.  For  example,  as  suggested  by  Thompson  and  Wi- 
jsman  (1990),  suppose  we  can  simulate  realizations  of  g 
from  the  the  conditional  distribution 

p»o,.7o(g|y)  (5) 

where  Pq  and  t/q  are  some  chosen  values  of  the  param¬ 
eters.  Let  g(<),I  =  1,...,T,  be  T  simulated  values  of 
g.  For  any  other  parameter  vector  (9,j)),  the  likelihood 
ratio  l{9,  rj)/l{9o,T)o)  =  pe.T,{y)/pg„.„oiy)  can  be  approx¬ 
imated  by 

1  ^  _PM(g(i)jO_ 

T  ^  P9„,.,„(g(0.y)’ 


(6) 
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This  approximation  is  justified  because  of  the  identity 


V9.r,{y) 

P9o,r)o(y) 


P9,^(g.y) 

p^o,vo  (S)  y ) 


Pflo,.jo(g|y)- 


(7) 


Expression  (6)  is  essentially  a  Monte  Carlo  estimate 
based  on  importance  sampling.  Note  that  (6)  is  easy  to 
compute  because  each  of  the  complete  data  likelihoods 
are  suppose  to  be  simple.  For  other  situations  where  we 
will  like  to  simulate  the  missing  genotypes  conditioned 
on  the  observed  data,  see  Kong  (1991)  and  Lange  and 
Sobel  (1990).  If  the  pedigree  can  be  peeled,  the  peeling 
algorithm  can  be  modified  for  the  simulations  described 
here  (Ploughman  and  Boehnke  1989,  Ott  1989  and  Kong 
1991).  Now  suppose  the  pedigree  cannot  be  peeled  be¬ 
cause  of  reasons  given  in  Section  2.  Here  is  where  Gibbs 
sampling  comes  into  play. 

Gibbs  sampling  is  a  very  general  technique  for  simu¬ 
lating  joint  realizations  of  dependent  variables.  The  idea 
is  very  simple.  Suppose  there  is  set  of  variables  which  we 
want  to  simulate  jointly,  possibly  conditioned  on  some 
observed  data.  This  set  of  variables  is  partitioned  into  a 
number  of  components  so  that  each  component  consists 
of  one  or  more  variables.  We  start  with  some  configu¬ 
ration  of  all  the  variables  which  is  compatible  with  the 
observed  data.  Individual  components  are  then  visited 
based  on  some  systematic  or  random  scheme.  When  a 
component  is  visited,  a  realization  of  it  is  drawn  con¬ 
ditioned  on  the  current  configuration  of  all  the  other 
components.  The  iterations  set  up  a  stationary  Markov 
chain  whose  equilibrium  distribution  is  the  same  as  the 
joint  distribution  we  want  to  simulate  from.  This  im¬ 
plies  that,  after  many  iterations,  which  means  that  each 
component  has  been  visited  many  times,  the  joint  re¬ 
alization  of  all  the  components  can  be  considered  as  a 
draw  from  the  desired  distribution  (conditioned  on  the 
observed  data  and  given  parameter  values).  A  key  point 
to  note  is  that  the  Gibbs  sampler  itself  does  not  specify 
exactly  how  the  variables  should  be  partitioned.  The 
later  is  a  decision  that  the  user  has  to  make.  There  are 
two  criteria  for  choosing  an  optimal  partition: 

(I)  Drawing  one  component  conditioned  on  the  others 
is  computationally  simple. 

(II)  The  Markov  chain  induced  by  the  Gibbs  sampler 
applied  to  this  partition  has  to  converge  reasonably  fast 
to  its  equilibrium  distribution. 

It  is  not  difficult  to  see  that  (I)  and  (II)  are  usually 
conflicting  criteria.  For  example,  not  partitioning  the 
variables  at  all  and  just  drawing  them  jointly  is  best 
under  (II),  but  it  is  in  general  impossible  to  implement 
and  is  the  reason  that  the  Gibbs  sampler  was  invented 
in  the  first  place.  In  general,  some  compromise  has  to 
be  made. 


A  natural  way  of  applying  Gibbs  sampling  to  pedi¬ 
grees  is  what  I  will  refer  to  as  person  by  person  Gibbs 
sampling.  In  this  case,  each  component  of  the  partition 
is  the  composite  genotype  of  an  individual.  This  par¬ 
tition  satisfies  criterion  (I)  mainly  because  each  person 
is  related  to  all  the  other  people  in  the  pedigree  only 
through  his/her  parents,  spouse  and  children.  This  ap¬ 
proach  however  can  run  into  serious  trouble  as  far  as 
criterion  (II)  is  concerned.  The  reason  is  that,  in  or¬ 
der  for  the  Gibbs  sampler  to  work,  the  induced  Markov 
chain  has  to  be  irreducible,  i.e.  each  point  in  the  state 
space  can  be  reached  from  another  point  through  the 
Markov  chain.  A  sufficient  condition  for  irreducibility  is 
the  positivity  condition  introduced  by  Besag(1974).  Ba¬ 
sically,  positivity  means  that  although  the  variables  are 
dependent  in  a  probabilistic  fashion,  any  joint  configu¬ 
ration  of  the  variables  are  logically  possible.  The  later 
is  clearly  violated  in  our  setting  because  given  the  geno¬ 
types  of  the  parents,  some  genotypes  for  the  offspring  are 
logically  eliminated.  Because  of  this,  for  general  genetic 
problems,  a  naive  application  of  person  by  person  Gibbs 
sampling  can  fail  completely.  While  there  are  tricks  (see 
for  example  Sheehan  and  Thomas  1991)  which  can  turn 
a  non-irreducible  chain  into  an  irreducible  one,  the  effi¬ 
ciencies  of  such  approaches  are  still  in  question.  More 
investigation  in  this  direction  is  warranted. 

4  Locus  by  Locus  Gibbs  Sam¬ 
pling 

In  this  section,  we  consider  a  multiple  loci  problem 
where  peeling  is  infeasible.  The  task  here  is  to  simu¬ 
late  the  unobserved  genotypes,  jointly  for  all  the  loci 
and  all  the  individuals  in  the  pedigree,  conditioned  on 
the  observed  data  and  possibly  some  fixed  values  of  the 
parameters.  Instead  of  doing  person  by  person  Gibbs 
sampling,  we  propose  an  alternative  method  which  will 
be  called  locus  by  locus  Gibbs  sampling. 

For  each  locus  and  each  non-founder,  we  define  two 
identity  by  decent  (IBD)  variables,  one  on  the  mother 
side  and  one  on  the  father  side.  Each  IBD  variable  is 
binary  and  indicates  whether  the  allele  at  a  particular 
locus  is  inherited  from  the  grandfather  or  grandmother. 
In  a  pedigree,  the  main  reason  that  the  genotypes  at 
different  loci  are  dependent  is  because  the  IBD’s  are 
correlated.  In  particular,  for  the  same  individual  ,  two 
IBD  variables  corresponding  to  a  single  parent  and  two 
linked  loci  are  positively  correlated.  As  pointed  out  by 
Lander  and  Green  (1987)  and  Kong  (1991),  under  the 
assumption  of  no  interference,  these  IBD  variables  form 
a  Markov  chain.  For  example,  suppose  we  have  five  or- 


382  A.  Kong 


dered  loci  A,B,C,D  and  E,  then  given  the  states  of  the 
IBD  variables  associated  with  C,  the  IBD  variables  asso¬ 
ciated  with  loci  A  and  B  are  independent  of  those  IBD 
variables  associated  with  loci  D  and  E.  However,  even 
if  interference  is  allowed  so  that  the  IBD  variables  do 
not  form  a  Markov  chain,  the  conditional  distribution 
of  the  IBD  variables  of  a  particular  locus  given  the  IBD 
variables  of  the  other  loci  can  still  be  easily  computed. 
Noting  this  key  fact,  instead  of  simulating  all  the  com¬ 
posite  genotypes  jointly,  we  can  construct  a  Gibbs  sam¬ 
pling  scheme  which  simulates  the  genotypes  and  IBD 
variables  one  locus  at  a  time.  When  a  locus  is  “visited”, 
we  draw  a  sample  of  its  genotypes  and  IBD  variables, 
jointly  for  all  individuals,  conditioned  on  the  observed 
data  of  that  locus  and  the  current  imputed  values  of  the 
IBD  variables  of  the  other  loci.  Computationally,  each 
simulation  step  requires  peeling.  However,  since  only  a 
single  locus  is  handled  each  time,  peeling  is  usually  pos¬ 
sible.  As  long  as  the  loci  are  not  right  on  top  of  each 
other,  it  is  trivial  to  show  that  locus  by  locus  Gibbs  sam¬ 
pling  always  lead  to  an  irreducible  Markov  chain.  This 
method  is  similar  in  spirit  to  one  proposed  by  Lange  and 
Sobel  (1990),  but  their  approach  has  the  limitation  that 
each  locus  must  have  only  two  alleles.  The  strength  of 
our  approach  is  that  it  does  not  require  no  interference, 
a  condition  that  is  crucial  for  the  method  proposed  by 
Lander  and  Green  (1987),  and  has  no  restrictions  on 
the  number  of  alleles  a  marker  may  have.  The  imple¬ 
mentation  of  locus  by  locus  Gibbs  sampling  is  currently 
underway. 

5  A  Model  with  Many  Parame¬ 
ters 

As  discussed  earlier,  for  genetic  models  which  have  many 
parameters,  point  by  point  evaluation  of  likelihoods  can 
be  highly  inefficient.  Here  we  consider  one  model  of  this 
type.  Suppose  we  are  doing  a  linkage  analysis  with  the 
gene  locus  of  a  quantitative  trait  and  a  single  marker 
locus.  Apart  from  the  genetic  effect,  suppose  there  is  an 
observed  categorical  covariate  with  three  levels  (0,1,2) 
which  may  also  has  an  effect  on  the  observable  trait. 
Following  Bonney  et  al  (1988),  we  consider  a  regression 
model  for  the  quantitative  trait: 

=  a  -b  0Gi  +  7iA'ii  -t-  72Ai2  +  +f».  (8) 

Here  z  is  the  quantitative  trait,  i  is  the  indicator  for 
individuals,  a  is  the  mean  for  individuals  who  does  not 
carry  the  disease  allele  and  whose  covariate  belongs  to 
level  0.  7j,j  =  1,2,  is  the  difference  between  level  j 
and  level  0,  A|j,j  =  1,2,  is  the  indicator  variable  for 


level  j  of  the  covariate,  is  the  genetic  effect,  G  is  the 
indicator  for  whether  an  individual  carries  the  disea,se 
allele  and  the  c^’s  are  assumed  to  be  noise  which  are  iid 
JV(0,<T‘).  Following  the  notations  established  in  Section 
2,  6  denotes  the  recombination  fraction  between  the  gene 
locus  and  the  marker  locus,  and  t]  denotes  the  vector 
(a, /?,  7i,  72,  <t).  Here  gi  denotes  the  composite  genotype 
which  includes  both  the  gene  locus  and  marker  locus. 
The  observed  data  yi  will  include  Zi,  the  A's  and  the 
single  locus  genotype  of  the  marker.  (For  some  individ¬ 
uals  in  the  pedigree,  even  j/,  may  be  missing.) 

Instead  of  simply  computing  likelihoods,  we  set  up  a 
Bayesian  model  where  the  parameters  are  also  treated 
as  random  variables.  Standard  conjugate  priors  (Box 
and  Tiao  1973)  are  eissigned  to  the  parameters.  This 
implies  that  if  both  the  missing  data  g  and  the  observed 
data  y  are  given,  the  complete  data  posterior  distribu¬ 
tion  p(6,  Tjlg,  y)  can  be  obtained  in  closed  form  and  is 
easy  to  draw  from.  With  this  setup,  we  do  Gibbs  sam¬ 
pling  by  iterating  between  the  parameters  (0,  rj)  and  the 
unobserved  genotype  vector  g.  Starting  with  some  ini¬ 
tial  configuration  g(0)  which  is  compatible  with  y,  we 
draw  a  realization  (^(  1 ),  ;?(1 ))  of  the  parameters  from  the 
conditional  distribution  p(^,  >?|g(0),y).  We  then  draw  a 
realization  g(l)  conditioned  on  (0(1),tj(1)).  In  general, 
at  time  t,  we  draw  a  sample  {9(t),ri(t))  from  the  condi¬ 
tional  distribution 

p(^'.'?ig(<  -  i)-y)  (1^) 

and  then  draw  a  sample  g(t)  from 

P{E\y<0(t),T](t))  (10) 

Drawing  from  (9)  is  simple  because  of  the  conjugate 
setup.  Drawing  from  (10)  requires  a  modification  of 
peeling  which  was  mentioned  in  Section  3.  Based  on 
the  theory  described  in  Section  2,  for  t  large,  {0(t),  y](t)) 
and  g(t)  can  be  considered  as  draws  from  the  desired 
conditional  distributions 

(11) 

and 

p(g|y)-  (12) 

where  (1 1)  is  the  posterior  distribution  of  the  parameters 
and  ( 12)  is  the  predictive  distribution  of  the  missing  data 
g.  Because  of  that,  histograms  of  the  drawn  parameter 
values  can  be  used  as  approximations  of  the  posterior 
distribution  (11). 

Using  a  particular  pedigree  structure  (see  Kong  et  al 
1991  for  details),  we  simulated  one  set  of  data  based  on 
0  =  0.005  and  (Q,ii,7t.72,<T)  =  (10.1,1,2,1).  We  then 
apply  the  Gibbs  sampler  to  the  simulated  data.  In  the 
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analysis,  flat  prior  distributions  are  used  for  the  param¬ 
eters  (a,/?, 71,72,  O')-  For  the  prior  distribution  of  the 
recombination  fraction  0,  we  use  a  probabilistic  mixture 
of  a  delta  function  at  0  =  1/2  and  a  uniform  distri¬ 
bution  between  0  and  1/2.  The  delta  function  at  1/2 
has  weight  .957  to  reflect  the  1  ;  22  prior  odds  against 
the  marker  and  disease  gene  being  located  on  the  same 
chromosome.  A  total  of  5000  iterations  were  run.  The 
histograms  displayed  in  Figure  1  are  constructed  based 
on  the  last  4000  samples  of  the  drawn  parameter  values. 
The  posterior  probability  supporting  linkage  (6  <  1/2) 
is  approximately  .91.  This  is  very  substantial  consider¬ 
ing  the  prior  distribution  we  used  for  0.  VVe  note  here 
that,  apart  from  posterior  distributions,  the  Gibbs  sam¬ 
pler  described  here  also  give  excellent  estimates  of  more 
traditional  inference  tools  such  as  lod  scores.  For  details, 
see  Kong  et  al  (1991). 

Instead  of  using  the  histograms  to  approximate  the 
posterior  distributions  of  the  parameters,  there  is  a  bet¬ 
ter  alternative.  Note  that 

p{0,T)\y)  ='^p(0,rj\g,y)p{g\y)  (13) 

g 

This  suggests,  in  our  example,  approximating  p{0,  ;7|y) 
by 

^  5000 

^  E  p(^.'?lg(<),y)  (14) 

(  =  1001 

assuming  that  each  of  the  4000  complete  data  posteriors 
can  be  obtained  in  closed  form.  Expression  (14)  is  called 
the  mixture  approximation  of  the  posterior  distribution. 
Liu  et  al  (1991)  has  proved  that  the  mixture  approxima¬ 
tion  is  always  superior  to  the  histogram  approximation 
in  the  sense  that  it  has  smaller  variance.  The  smooth 
curve  in  Figure  la  is  the  mixture  approximation.  With 
a  little  of  work,  similar  approximations  can  be  obtained 
for  the  other  parameters. 

6  Remarks 

Using  two  examples,  we  illustrated  how  the  methods  of 
peeling  and  Gibbs  sampling  can  be  combined.  Poten¬ 
tially  there  are  many  problems  where  the  same  idea  can 
be  applied.  It  can  be  problems  which  have  the  charac¬ 
teristics  of  both  of  our  examples.  In  some  cases,  it  may 
make  sense  to  combine  locus  by  locus  Gibbs  sampling 
with  person  by  person  Gibbs  sampling.  For  example, 
consider  a  linkage  analysis  where  the  genetic  model  re¬ 
lating  the  phenotype  to  the  genotype  is  too  complicated 
so  that  the  gene  locus  cannot  be  peeled.  Here  we  proba¬ 
bly  will  still  want  to  simulate  the  marker  genotypes  and 


IBD’s  jointly  using  peeling.  However,  for  the  gene  lo¬ 
cus,  we  may  be  forced  to  apply  person  by  person  Gibbs 
sampling. 

Computational  issues  aside,  there  is  also  the  question 
of  statistical  inference.  Bayesian  inference  provides  an 
alternative  to  traditional  inference  which  is  based  mainly 
on  profile  likelihoods  and  lod  scores.  The  relative  merits 
of  these  different  approaches  warrant  further  research. 
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Figure  1 .  Histograms  of  Simulated  Parameter  Values 
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Summary 

Human  geneticists  have  been  extremely  successful 
in  the  past  decade  in  mapping  rare  disease  genes.  For 
common  diseases  with  a  substantial  genetic  component,  the 
payoff  for  human  health  is  larger,  but  the  mapping  problems 
are  harder.  There  is  a  need  for  robust  statistical  techniques 
that  require  minimal  assumptions  about  the  mode  of 
inheritance  of  the  disease  studied.  The  affected-pedigree- 
member  (APM)  of  linkage  analysis  makes  virtually  no 
assumptions  about  disease  uansmission  except  that  it  is 
independent  of  marker  transmission  in  a  pedigree.  We 
discuss  here  an  extension  of  the  APM  method  from  single 
markers  to  multiple  closely  linked  markers.  This  extension 
should  improve  the  power  of  the  APM  method  to  detect 
linkage. 

Introduction 

Chromosomes  are  not  passed  intact  from  generation 
to  generation.  During  the  formation  of  gametes  (eggs  and 
sperm),  homologous  chromosomes  align  and  recombine. 
This  produces  gamete  chromosomes  that  alternate  between 
maternal  and  paternal  sources.  The  further  apart  two  loci 
are,  the  more  likely  it  is  that  recombination  will  occur 
between  them.  Recombination  has  the  effect  of  separating 
an  allele  at  one  locus  from  an  allele  at  the  second  locus.  The 
frequency  of  this  reshuffling,  whether  directly  observable  or 
not  in  the  offspring  of  a  parent,  is  termed  the  recombination 
fraction  between  the  loci.  Conventional  linkage  analysis 

§  This  work  supported  in  part  by:  the  University  of 
California,  Los  Angeles;  Harvard  University;  the  University 
of  Pittsburgh;  and  USPHS  grant  CA  16042. 


aims  at  estimating  the  recombination  fraction.  In  order  to 
measure  recombination  between  a  disease  locus  and  a  neutral 
marker  locus,  there  must  be  an  explicit  model  for  the 
phenotypic  expression  of  the  disease  locus.  Such  a  model 
permits  the  inference  of  underlying  disease  genotypes  from 
observed  disease  phenotypes.  For  rare  Mendeban  diseases, 
the  model  is  usually  clear,  and  linkage  analysis  has  proved 
to  be  an  extremely  powerful  tool  for  mapping  these  diseases 
(e.g.,  Kerem  et  al.  1989;  Riordan  et  al.  1989;  Rommens  et 
al.  1989). 

For  more  complex,  common  diseases  such  as 
schizophrenia  (Weeks  et  al.  1990).  no  one  knows  the  correct 
genetic  model.  This  quandary  is  hardly  resolved  by  selecting 
a  simple  model  inconsistent  with  the  known  pedigree  data. 
In  fact,  if  the  genetic  model  is  misspecified.  then  this  may 
mask  the  evidence  for  linkage  (Baron  1990;  Clerget-Darpoux 
el  al.  1990;  Martinez  et  al.  1989;  Weeks  et  al.  1990a).  For 
example.  Figure  1  displays  a  pedigree  in  which  the 
unaffected  daughter  is  almost  certainly  a  recombinant  under 
an  incorrect  model  but  more  likely  a  nonrecombinant  under 
the  correct  model.  Incorrect  inferences  about  recombination 
events  are  disastrous  for  lod  scores. 

In  the  current  paper  we  present  an  alternative  to 
computing  lod  scores  under  questionable  models.  The  APM 
method,  which  uses  all  the  affected  individuals  in  a  pedigree, 
was  preceded  and  inspired  by  the  earlier  affected-sib-pair 
methods,  which  used  only  affected  siblings  (Day  and  Simons 
1976;  de  Vries  el  al.  1976;  Fishman  et  al.  1978;  Green  and 
Woodrow  1977;  Haseman  and  Elston  1972;  Lange  1986a; 
Lange  1986b;  Lange  and  Weeks  1990;  Penrose  1935;  Suarez 
and  Hodge  1979;  Suarez  et  al.  1978:  Thom.son  and  Bodmer 
1977;  Weeks  and  Lange  1988;  Weeks  and  Lange  1991). 
These  methods  are  motivated  by  the  fact  that  affected 
individuals  will  tend  to  be  alike  at  markers  closely  linked  to 
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a  disease  locus.  Marker  similarity  between  unaffected 
individuals  is  less  predictable  and  is  therefore  ignored.  The 
APM  method  seeks  to  answer  in  a  model-free  way  the 
simple  question:  Are  the  affected  individuals  more  similar 
than  expected  by  chance  at  the  marker  locus  (or  at  a  group  of 
markers  closely  linked  to  each  other)?  To  answer  this 
question,  we  need  a  measure  of  marker  similarity  among  the 
affecteds  of  a  pedigree  and  some  idea  of  the  statistical 
distribution  of  the  measure. 
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between  two  common  alleles  is  less  striking  than  a  match 
between  two  rare  alleles.  Thus,  we  modify  our  measure  of 
marker  similarity  so  that  each  match  in  state  contributes  a 
fourth  times  f(p),  where  f  is  a  weighting  function  of  the 
marker  allele  frequency  p.  For  example,  if  f(p)=l^  and  the 
allele  frequencies  of  allele  A  and  allele  B  are  0.10  and  0.90, 
respectively,  then  two  individuals  i  and  j  who  are  both  A/B 
at  the  marker  locus  would  have  a  similarity  measure  Zjj  = 
(1/4)(10)  +  (1/4)(1.11)  =  2.78.  Since  the  magnitude  of  this 
similarity  measure  is  meaningless  by  itself,  we  now  turn 
our  attention  to  the  calculation  of  the  mean  and  variance  of 
Z,j  under  the  null  hypothesis  of  independent  segregation  of 
disease  and  markers. 

Table  1:  Possible  values  of  the  unweighted  similarity 
statistic  Zjj  for  a  pair  of  affected  individuals. 


Marker  Genotypes 
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A/A 

A/A 
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A/B 
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Figure  1:  If  a  single-locus  recessive  model  is  postulated, 
then  at  least  one  recombination  event  must  be  invoked  to 
explain  the  sib  phenotypes.  If  a  two-locus  doubly  recessive 
model  is  postulated,  then  the  sib  phenotypes  are  consistent 
with  no  recombination.  Note  that  parental  phases  are,  in 
fact,  unknown. 

A  similarity  measure 
We  will  first  discuss  how  to  define  marker 
similarity  at  a  single  marker  locus.  One  natural  measure  of 
marker  similarity  is  simply  to  count  the  number  of  matches 
of  marker  alleles  between  two  affected  individuals  i  and  j, 
letting  each  match  contribute  a  fourth  to  the  statistic  Zjj 
(Table  1).  In  other  words,  we  look  at  the  identity-by-state 
status  of  the  marker  alleles.  Two  alleles  are  identical-by- 
state  if  they  are  the  same  allele,  regardless  of  ancestral 
origin.  Note  that  our  measure  ignores  the  common  sense 
notion  that,  given  two  distantly  related  affecteds,  a  match 


Mean  of  the  similarity  measure 

First,  let  us  rewrite  the  similarity  measure  Zjj  as 
the  conditional  expectation 

Z-j  =  E|^Ujj  I  marker  genotypes  of  i  and  j  J  .  (1) 

To  define  the  random  variable  appearing  on  the  right  of 
(I),  imagine  sampling  random  gametes  from  i  and  j.  If  the 
two  sampled  gametes  bear  the  same  allele  ar  at  the  current 
maiker  locus,  then  set  Ujj  =  ffp^),  where  pr  is  the  population 
frequency  of  ap  If  the  two  alleles  do  not  match  in  state,  set 
Uij  =  0.  Asa  consequence  of  ( 1 ).  E(Zjj)  =  E(Uy). 

To  better  understand  the  random  variable  Ujj.  it  is 
helpful  to  introduce  the  notion  of  identity-by-descent.  Two 
genes  are  identical-by-descent  if  one  is  a  copy  of  the  other  or 
they  are  both  copies  of  the  .same  ancestral  gene.  Identity-by- 
de.sccnt  obviously  implies  identity-by-state.  but  not 
conversely.  It  is  quite  easy  to  calculate  the  probability  that 
a  gene  drawn  at  random  from  one  person  is  identical-by- 
de.sccnt  to  a  gene  at  the  same  locus  drawn  at  random  from 
another  person.  This  probability  known  as  the  kinship 
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oefficient,  depends  only  on  the  relationship  between  the 
wo  individuals  i  and  j,  i.e.,  on  the  graphical  structure  of  the 
)edigree  connecting  them. 

The  mean  of  Uy  may  be  calculated  by  conditioning 
m  the  identity-by-descent  status  of  the  alleles  being 
;ompared.  If  the  alleles  are  identical -by -descent  (with 
probability  d>ij),  then  Ujj  takes  on  the  value  f(Pf)  with 
probability  p^.  If  the  alleles  are  not  identical-by-descent 
[with  probability  l-<l>ij).  then  Uy  takes  on  the  value  f(Pf) 
with  probability  p^.  since  the  two  alleles  enter  the  pedigree 
of  i  and  j  independently.  These  considerations  lead  to 

E(Zy)  =  Oij^p^  f(Pr)  +  (l-<I>ij)2^pff(Pr)• 
r  r 

We  define  the  similarity  measure  Z  for  a  pedigree  to 
be  the  sum  of  the  similarity  measures  Zy  between  all 
possible  affected  pairs.  In  other  words, 

Z  =  IZij. 

i<j 

The  expectation  of  Z  is  simply  the  sum  of  the  expectations 
of  the  Zjj's.  To  calculate  the  variance  of  Z.  it  is  necessary 
to  compute  terms  such  as  EfZyZj^i).  involving  up  to  four 
distinct  individuals.  Fortunately.  EfZyZi^i)  may  be 
calculated  by  a  conditioning  approach  very  similar  to  that 
used  to  calculate  the  mean  of  Zy.  Instead  of  conditioning  on 
whether  two  genes  are  identical-by-descent,  we  now 
condition  on  the  identity-by-descent  relationships  among  the 
genes  drawn  from  the  four  individuals  i,  j,  k,  and  1.  The 
probability  of  each  of  the  15  possible  identity-by -descent 
partitions  of  these  four  genes  can  be  easily  calculated  using 
the  generalized  kinship  coefficients  of  Karigl  (1981;  1982), 
as  extended  by  Weeks  and  Lange  (1988).  In  short,  it  is 
possible  to  calculate  exactly  the  theoretical  mean  and 
variance  of  the  similarity  measure  Z  for  any  pedigree.  We 
then  standardize  Z  by  subtracting  off  its  mean,  dividing  by 
its  standard  deviation,  and  weighting  by  Vm,  where  r  is  the 
number  of  affected  individuals  in  the  pedigree: 

V  Var(Z) 

For  several  pedigrees,  we  form  a  grand  APM 
statistic  T  with  mean  zero  and  variance  1  by  dividing  the 
sum  of  the  standardized  measures  Wj  from  each  pedigree  t  by 
the  appropriate  sum  of  weights; 


^I(r,  -  1) 

The  statistic  T  is  asymptotically  standard  normal,  provided 
the  number  of  pedigrees  is  large.  A  one-sided  test  based  on 
the  observed  T  is  appropriate  since  linkage  acts  to  increase 
marker  sharing  among  affecteds. 

Extension  to  multiple  marker  loci 

As  marker  maps  of  the  human  genome  become 
denser,  investigators  are  more  likely  to  type  disease 
pedigrees  at  several  closely  linked  marker  loci.  A  set  of 
closely  linked  markers  might,  collectively,  provide  a  more 
accurate  measure  of  marker  similarity  than  any  one  marker 
alone.  Thus,  we  have  extended  the  APM  method  to  use 
simultaneously  information  from  several  marker  loci  (Lange 
and  Weeks  1990).  For  clarity,  we  will  consider  only  two 
marker  loci  A  and  B  below.  The  results  easily  generalize  to 
several  marker  loci.  The  most  obvious  definition  of 
similarity  at  several  marker  loci  is  to  take  the  sum  of  the 
individual  marker  similarities.  That  is.  for  marker  loci  A 
andB, 

_A  _B 

Zi,  *  Zi,  *  z:,,. 

For  a  pedigree,  we  then  define 

i<j  i<j  i<j 
=  Z^  +  Z®. 

The  mean  is  easily  computed  as 
E(Z)  =  E(ZA)  -I-  E(ZB), 
but  the  variance  poses  more  difficulties  since 

Var(Z)  =  VaifZA)  +  Var(ZB)  +  2  Cov(  Z\  Z^) . 

Because  of  the  single  locus  results  it  suffices  to  compute 
CovfZ^Z®)  =  E(ZAzB)  -  E(ZA)E(ZB). 

Notice  that  since 

E(zAz»)  =  E[(lz'^)(i:zfj)]  , 

^i<j  ^i<j 

A  B 

we  must  evaluate  terms  such  as  E(Zjj  Zj^j^). 

As  before, 

Z^  =  I  marker  genotypes  of  i  and  jj. 

B  B 

and  similarly  for  Zj^  and  Uj^^. 
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A  B 

'  "  are  conditionally  independent  given  the 


Since  and 

marker  genotypes  of  i,  j,  k,  and  m, 

—  cfiAiB  ,  marker  genotypes 
'  of  1,  j,  k,  and  m 


A  B 

The  only  way  in  which  U-  and  can  influence 
one  another  is  for  identity-by-descent  sharing  at  one  locus  to 
increase  the  chance  of  identity-by-descent  sharing  at  the  other 
locus.  Thus,  we  can  compute  the  expectation  by 

conditioning  on  the  combined  identity-by-descent  states  of 
the  i  and  j  sampled  gametes  at  locus  A  and  the  k  and  m 
sampled  gametes  at  locus  B.  The  probabilities  of  the  four 
possible  combined  identity-by-descent  states  can  be  found 
using  the  two-locus  kinship  coefficients  of  Thompson 
(1988).  For  details,  see  Weeks  and  Lange  (1991). 


Application  to  Simulated  Data 

Using  the  simulation  program  SLINK  (Ott  1989; 
Weeks  et  al.  1990b).  we  simulated  two  markers  flanking  the 
tuberous  sclerosis  disease  locus,  conditional  on  the  structure 
and  affection  status  of  the  nine  tuberous  sclerosis  pedigrees 
from  Janssen  et  al.  (1990).  The  recombination  fraction 
between  the  left  marker  Ml  (3  alleles,  heterozygosity  = 
0.53)  and  the  disease  locus  was  taken  as  0.01.  and  the 
recombination  fraction  between  the  right  marker  M2  (4 
alleles,  heterozygosity  =  0.65)  and  *;e  disease  locus  was 
taken  as  0.02.  While  the  maximum  multipoint  lod  score 
using  marker  data  on  the  affecteds  alone  was  only  about 
0.90,  the  APM  method  detected  significant  linkage  (Table 
2).  When  the  intermediate  weighting  function  f(p)  =  1/Vp 
is  used,  the  multilocus  statistic  is  slightly  more  significant 
than  either  of  the  single  locus  statistics.  In  the  two 
examples  given  in  Weeks  and  Lange  (1991).  the  superiority 
of  the  multilocus  statistic  is  much  more  evident. 

In  order  to  investigate  the  distribution  of  the 
multilocus  APM  statistic  under  the  null  hypothesis  of  no 
linkage,  we  simulated  the  .segregation  of  the  two  markers 
Ml  and  M2,  independently  of  disease  status.  As.suming  no 
interference,  the  recombination  fraction  between  the  markers 
is  0.0296.  Table  3  summarizes  results  for  the  multilocus 
APM  statistics.  As  we  observed  previously  with  the  single 
locus  statistic  (Weeks  and  Lange  1988).  the  tails,  skewne.ss. 


and  kurtosis  of  the  APM  statistic  increase,  as  the  influence 
of  allele  frequency  increases.  Figure  2  provides  a  histogram 
of  the  APM  statistic  for  the  intermediate  weight  f(p)  = 

1/Vp. 


Table  2:  Application  of  the  single  locus  and  multiiocus 
APM  statistics  to  simulated  tuberous  sclerosis  data;  0  = 
0.0296  between  the  flanking  markers.  _ 


Function 

Ml 

M2 

Multilocus 

(P-value) 

f(p)  =  1 

2.99356 

-0.01646 

2.27595 

(0.0114) 

f(p)=  iW^ 

2.22309 

1 . 13318 

2.26211 

(0.0118) 

II 

0.22274 

1.83509 

1.62172 

(0.0524) 

Table  3:  Simulation  results  for  the  multilocus  test  statistic. 
ba.scd  on  5.000  trials. 


Function 

Mean 

(Variance) 

Skewness® 

(Kurtosis®) 

Empirical 

P-value 

f(p)=l 

0.01390 

(0.99221) 

0.17897 

(-0.05648) 

0.01620 

f(p)  =  l/^^p 

0.01232 

(1.00957) 

0.25233 

(0.20503) 

0.01840 

f(p)  =  1/p 

-0.00577 

(1.00060) 

1.05786 

(2.38970) 

0.06280 

®For  5.000  trials,  a  skewness  of  0.057  is  significant  at  the 
0.05  level,  a  skewness  of  0.081  is  significant  at  the  0.01 
level.  A  kurtosis  of  0.114  is  significant  at  the  0.05  level, 
and  a  kurtosis  of  0.161  is  significant  at  the  0.01  level. 


Function 

Upper  Fifth 
Percentile'’ 

Upper  First 
Percentile'’ 

f(p)=l 

1.711 

2.448 

f(p)  =  1/Vp 

1 .714 

2.622 

f(p)= 

1.754 

3.073 

'’These  arc  empirical  percentiles.  For  a  standard  normal 
variate,  the  theoretical  upper  fifth  and  first  percentiles  are 
1.645  and  2.326. 
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Value  of  observed 
statistic  =  2.26 


■igure  2:  Tuberous  sclerosis  data:  histogram  of  the 
nultilocus  test  statistic,  based  on  5.000  trials,  when  f(p)  = 
/Vp.  The  theoretical  standard  normal  P-value  is  0.01 18, 
vhile  the  empirical  P-valuc  is  0.01840. 

Discussion 

The  affected-pcdigrce-mcmbcr  (APM)  methcxl  of 
inkage  analysis  requires  marker  typing  of  only  the  affected 
ndividuals  in  a  pedigree.  More  importantly,  the  APM 
nethod  requires  no  assumptions  about  the  mode  of 
nheritance  of  the  disease.  This  may  provide  an  advantage 
)ver  traditional  methods  of  linkage  analysis,  which  require 
in  explicit,  and  often  unverifiabie.  di.scase  model.  Although 
t  is  reasonable  to  expect  the  APM  method  to  exhibit 
«verely  reduced  power  relative  to  conventional  linkage  lod 
score  methods,  in  the  presence  of  genetic  heterogeneity  it 
nay.  in  fact,  perform  better.  The  multiple  locus  extension 
liscussed  here  should  make  the  APM  method  a  more 
versatile  and  powerful  tool  for  genetic  epidemiologists. 
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I,  INTRODUCTION 

In  1965,  A.W.F.  Edwards  and  L.L.  Cavalli-Sforza 
introduced  a  method  for  cluster  analysis  based  on  a  recursive 
partitioning  strategy  over  a  minimum-variance  clustering 
criterion.  Although  this  method  has  been  called  “intuitively 
appealing”,  it  was  dismissed  by  Gower  (1967)  and  others 
because  of  its  computational  infeasibility.  It  has  been 
suggested  on  numerous  occasions  that  some  computationally 
efficient  method  be  found  to  search  an  intelligently-chosen 
subset  of  the  set  of  all  possible  partitions  for  a  (hopefully) 
near-optimal  solution.  In  this  paper,  one  such  method  is 
introduced  which  borrows  from  the  Classification  and 
Refression  Trees  (CART)  classification  paradigm  of 
Breiman,  Friedman,  Olshen  and  Stone  (1984). 

II,  THE  CLUSTERING  ALGORITHM 
I.  Building  the  Clustering  Tree 

Consider  a  set  S  of  n  observations  in  p  variables  and 
represent  any  arbitrary  observation  by  the  vector  i  =  (xj,  X2, 
. . . ,  Xp).  As  a  first  step,  we  would  like  to  partition  this  set  of 
n  p-dimensional  vectors  into  two  subsets  based  rolely  on  the 
observations  given.  The  method  used  by  this  algorithm  is  an 
application  of  the  standard  split  found  in  the  CART 
classification  scheme.  In  order  to  define  the  splitting  rules,  we 
generate  all  sets  of  the  form 

{Xj<c},  j  =  ],  ...,p  . 

Geometrically,  sets  of  this  type  are  regions  bounded  by 
(n-l)-dimensional  hyperplanes  parallel  to  the  co-ordinate 
axes  which  are  selectively  passed  through  n-space  precisely 
in  the  center  (with  respect  to  variable  j)  of  every  pair  of  data 
points.  Each  of  these  hyperplanes  will  specify  a  partition  of  5 
into  two  new  sets,  S)  and  52- 

The  cs  can  be  easily  determined  by  sorting  the  j 
component  of  all  the  observations,  then  selecting  values 
halfway  between  each  successive  pair  of  x  y  s.  In  this  fashion, 
the  algorithm  is  guaranteed  to  create  the  least  number  of 
hyperplanes  that  will  still  produce  every  partition  of  S 


possible  using  such  a  method  (in  fact,  for  any  j  there  are  at 
most  (n-l)  of  them). 

2.  The  “Goodness~of-Splii”  Criterion 

Of  course,  some  measure  of  the  effectiveness  of  a  split  is 
needed  in  order  to  choose  one  of  these  p(n-l)  (at  most) 
partitions  as  best.  As  in  the  original  Edwards  &  Cavalli- 
Sforza  algorithm,  the  minimum-variance  criterion  was 
selected.  Although  this  measure  can  be  somewhat  sensitive  to 
outliers,  it  has  been  tested  by  a  number  of  researchers  and 
shown  to  be  effective  in  a  wide  range  of  clustering  situations 
(Blashfield  1976,  Milligan  1983). 

More  explicitly,  from  the  standard  set  of  splits  generated 
as  above,  the  “optimal”  split  is  chosen  to  be  that  which 
maximizes  the  quantity 

VAR  (S)  -  [VAR  (S,)  -I- VAR  (S2)] 

After  the  best  split  has  been  selected  the  algorithm 
proceeds  recursively,  splitting  Sj  and  S2,  then  the  best  splits 
of  these  subsets,  and  so  on. 

3.  Finding  the  “Optimal”  Subtree 

Certainly,  if  we  allow  the  partitioning  process  to  continue 
to  completion,  we  will  have  an  overspecification  of  the  data 
structure,  with  each  terminal  node  of  the  clustering  tree 
containing  a  very  few  data  points.  A  method  must  be  found, 
therefore,  to  select  the  subtree  of  this  complete  tree  that  most 
accurately  represents  the  gross  structural  characteristics  of 
the  data.  This  problem  is  exactly  the  “number  of  clusters” 
question  that  has  been  addressed  repeatedly  in  the  literature 
since  the  mid-1960s.  Although  no  general  analytic  method 
has  yet  been  found,  a  number  of  statistical  or  heuristic 
stopping  rules  have  been  employed  with  varying  degrees  of 
success  (see  Milligan  1985  for  a  comprehensive  review  and 
analysis). 

As  a  hierarchical  clustering  method,  recursive 
partitioning  is  amenable  to  the  application  of  many  of  these 
stopping  rules.  Following  a  thorough  search  of  the  literature 


Recursive  Partitioning  for  Cluster  Analysis  393 


and  testing  on  constructed  data  sets,  two  such  rules  were 
found  that  seem  to  perform  quite  well  in  tandem  with  the 
recursive  partitioning  algorithm.  These  stopping  rules  are  due 
to  Calinski  &  Harabasz  and  Duda  &  Hart. 

The  approach  of  Calinski  and  Harabasz  is  to  find  that 
clustering  of  the  data  which  maximizes  the  Variance  Ratio 
Criterion  (VRC) 

_  BGSS  .  WGSS 

~  Ti - TT'  - TT’ 

ik-\)  (n-k) 


where  BGSS  and  WGSS  are  the  between-  and  within-cluster 
sums  of  squares,  n  is  the  number  of  data  points  in  the  set,  and 
k  is  the  number  of  clusters  in  the  current  partition.  This 
method  was  implemented  by  ordering  all  splits  in  the 
clustering  tree  according  to  the  splitting  criterion  and 
computing  the  VRC  for  all  subtrees  of  the  complete  tree 
created  by  recursively  pruning  away  the  lowest-rated 
remaining  split.  The  subtree  with  the  maximum  VRC  is  then 
selected  to  represent  the  optimal  clustering  of  the  data. 

The  method  of  Duda  and  Hart  is  statistical  in  nature,  and 
is  applied  during  the  initial  “growing”  of  the  clustering  tree. 
The  best  split  at  each  node  (as  defined  in  Section  11.2)  is 
examined  and  the  ratio 


Je(2) 

where  Jgfl)  is  the  WGSS  of  the  node  prior  to  the  split  and 
Je(2)  is  the  WGSS  over  the  pair  of  subsets  resulting  from  the 
application  of  the  split,  is  used  as  a  test  of  the  null  hypothesis 
that  the  initial  set  of  samples  was  drawn  from  a  normal 
population  with  mean  |1  and  covariance  matrix  .  A  rough 

estimate  of  the  sampling  distribution  of  the  J,.  s  may  be 
formulated,  yielding  the  final  test :  Reject  the  null  hypothesis 
(i.e.,  split  the  node)  at  the  p-percent  level  of  significance  if 


Je(2)  ^  2  I  2il-%/{n^d)) 

J^(l)  ^  nd  nd 


where  d  is  the  dimensionality  of  the  data,  n  is  the  number  of 
data  points  in  the  node,  and  a  is  determined  as  usual  by 


p  =  lOOj 

a 


du 


Of  course,  the  performance  of  the  Duda-Hart  rule  will  be 
directly  related  to  the  agreement  of  the  data  with  the 
underlying  assumptions  of  normality  and  form  of  the 
covariance  matrix,  and  therefore  some  care  should  be 
exercised  when  using  it  with  data  with  a  wildly  asynunetric 
or  otherwise  unusual  distribution. 

These  two  stopping  rules  were  also  found  to  work  well  in 
concert  with  each  other.  Although  use  of  the  Duda-Hart  rule 
does  require  specification  of  the  control  parameter  a,  this 
requirement  makes  it  useful  for  interactive  examination  of  the 
stability  of  clustering  solutions.  In  addition,  the  Duda-Hart 
criterion  is  able  to  test  the  one-cluster  hypothesis  while  the 
Calinski-Harabasz  VRC  is  not. 

III.  ALGORITHM  PERFORMANCE 
I.  Experimental  Design  and  Data  Generation 

In  order  to  analyze  the  performance  of  this  algorithm  and 
compare  it  with  other  clustering  methods,  a  series  of  Monte 
Carlo  tests  were  undertaken.  The  structure  of  these  tests 
followed  very  closely  that  used  by  Milligan  (1983,  1985). 
Data  sets  containing  50  points  each  were  generated  in 
accordance  with  a  structured  experimental  design.  The  design 
was  defined  by  three  factors:  number  of  clusters, 
dimensionality  of  the  data,  and  distribution  of  data  points 
across  clusters.  The  number  of  clusters  in  each  data  set  ranged 
from  two  to  five  and  each  set  was  embedded  in  either  four,  six, 
or  eight  dimensions.  In  order  to  ensure  testing  over  disparate 
cluster  sizes,  the  distribution  of  points  across  clusters  was 
varied  according  to  the  following  schemes  :  Type  A  -  points 
evenly  distributed  across  clusters;  Type  B  -  60%  of  points  in 
one  cluster;  or  Type  C  -  10%  of  points  in  one  cluster. 

The  experimental  design  thus  contained  36  cells,  each  of 
which  were  replicated  three  times  for  a  total  of  108  data  sets. 

The  method  of  cluster  generation  was  chosen  to  produce 
the  characteristics  of  internal  cohesion  and  external  isolation 
noted  (Milligan  1983,  Everitt  1980)  as  being  indicative  of 
natural  cluster  structure.  In  order  to  satisfy  the  internal 
cohesion  requirement,  all  clusters  were  composed  of  points 
drawn  from  truncated  (at  1.5  standard  deviations) 
multivariate  normal  distributions  with  standard  deviations  on 
each  dimension  chosen  randomly  from  the  range  (0.25, 2.00). 
To  ensure  external  isolation,  clusters  were  not  allowed  to 
overlap  in  the  first  dimension;  in  fact,  the  .separation  between 
the  means  of  two  adjacent  clusters  in  this  dimension  was 
defined  by  a  function  of  the  form  u  (Oi  +  Oj),  where  Oj  and 
O2  were  the  standard  deviations  (in  the  first  dimension)  of  the 
two  clusters,  and  u  was  a  constant  drawn  from  a  uniform  ( 1 .75 
,  2.25)  distribution.  The  positions  of  cluster  centers  in  the 
remaining  dimensions  were  .selected  randomly  from  within  a 
range  2/3  as  large  as  that  of  the  first  dimension,  so  cluster 
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overlap  was  possible  and  frequently  did  occur  in  these 
dimensions. 

This  method  of  cluster  generation,  when  shzqied  by  the 
factors  comprising  the  experimental  design,  produced  a  body 
of  data  containing  a  wide  variety  of  cluster  shapes,  sizes,  and 
relative  orientations  that  was  felt  to  be  a  fair  test  of  an 
algorithm’s  ability  to  discern  true  cluster  structure. 

2.  Results  I  -  Recovery  of  Cluster  Membership 

Each  of  the  108  data  sets  was  analyzed  for  cluster 
structure  with  five  different  algorithms;  single  linkage, 
complete  linkage,  group  average,  k-means,  and  the  author’s 
implementation  of  the  algorithm  described  in  the  previous 
section  (henceforth  called  RPCLUS).  For  the  k-means 
method,  an  average  of  results  from  three  different  runs  was 
used  for  each  data  set,  with  a  random  ordering  of  the  data  for 
each  run.  The  output  of  each  of  the  methods  was  examined  at 
the  level  of  the  correct  number  of  clusters  and  this  output  was 
compared  to  the  (known)  true  structure  of  the  data  by  the  use 
of  the  corrected  Rand  statistic  (Rand  1971,  Milligan  1983). 
This  measure  of  similarity  is  defined  as 


XZNJ-  (ZZN2n2.)/n2 
+  (ZSNJ N^) /N^  * 


where  Ny  is  the  number  of  data  points  placed  in  cluster  i  by 
the  algorithm  that  are  in  cluster)  of  the  actual  solution,  Nj  and 
N  j  are  the  marginal  totals  and  N  is  the  total  number  of  data 
points. 

The  corrected  Rand  index  assumes  a  value  of  1 .00  when 
the  two  clusterings  are  in  total  agreement.  Its  lower  bound 
depends  on  the  actual  partition  of  the  data,  but  is  usually  0.00 
or  very  slightly  below.  This  index  was  cho.sen  for  reasons 
made  clear  in  other  studies  (Milligan  1983):  its  high 
variability  as  compared  to  similar  measures,  as  well  as  its 
consistency  across  different  cluster  scenarios.  On  the  advice 
of  such  studies,  a  second  index  (Jaccard)  was  also  used  to 
evaluate  the  clusterings,  but  as  it  was  in  complete  agreement 
with  the  corrected  Rand  statistic  in  regards  to  the  relative 
performance  of  the  algorithms,  it  was  not  felt  necessary  to 
include  those  values  in  the  current  report. 

A  table  summarizing  the  results  of  the  complete  Monte 
Carlo  testing  with  respect  to  each  of  the  three  design  factors 
(number  of  clusters,  dimensionality,  and  point  distribution)  is 
available  from  the  author.  For  the  data  in  this  study,  the 
complete  linkage,  RPCLUS  and  group  average  methods 
were  clearly  in  a  .separate  cla.ss  from  single  linkage  and  k- 


means,  with  overall  recovery  means  of  0.987,  0.986,  0.985, 
0.955  and  0.909  respectively.  There  were  strong  similarity  in 
the  recovery  means  for  RPCLUS,  complete  linkage  and 
group  average  across  almost  all  factors.  The  notable 
exceptions  were  the  five-cluster  and  Type  C  disU-ibution 
results,  where  RPCLUS  paid  the  price  for  its  minimum- 
variance  characteristic  of  seeking  evenly-sized  partitions. 
However,  even  these  differences  were  quite  small;  in  fact,  it 
was  often  one  misclustered  point  that  accounted  for  the 
discrepancies  in  the  corrected  Rand  means. 

3.  Results  il  -  Number  of  Clusters 

Testing  was  also  undertaken  in  order  to  measure  the 
accuracy  of  the  two  stopping  rules  described  in  Section  11.3 
when  used  as  constraints  on  the  recursive  partitioning 
algorithm.  The  same  108  data  sets  were  u.sed  and  for  each  of 
these  sets  a  record  was  kept  of  the  number  of  clusters 
indicated  by  each  of  the  two  rules.  Overall,  the  Calinski- 
Harabasz  rule  exactly  determined  the  number  of  clusters 
present  in  91  data  sets  (84.2%),  while  it  was  within  one 
cluster  in  either  direction  105  times  (97.2%).  In  the  case  of 
the  Duda-Hart  rule,  experimentation  revealed  that  optimum 
performance  was  obtained  when  a  was  set  to  3.00 
(corresponding  to  a  99.865%  level  of  significance),  and  using 
this  value  the  statistical  rule  produced  the  correct  number  of 
clusters  90  times  (83.3%)  and  was  within  one  cluster  in  103 
of  the  data  sets  (95.4%).  Clearly  both  rules,  when  used  in 
conjunction  with  the  recursive  partitioning  algorithm, 
provide  reliable  information  as  to  the  number  of  clusters 
present  in  data  with  true  cluster  structure. 

IV.  ADVANTAGES 

Monte  Carlo  testing  thus  has  shown  this  new  algorithm  to 
be  equivalent  in  performance  to  commonly-used  techniques 
such  as  complete-linkage  and  the  group  average  method. 
Why  then  should  we  be  interested  in  another  cluster  analysis 
tool?  Two  reasons  are  readily  apparent.  First,  cluster  analysis 
is  such  an  inexact  science  that  it  can  never  hurt  to  have  a 
number  of  different  approaches  to  use  when  beginning  the 
analysis  of  a  set  of  data.  This  is  important  becau.se  each  type 
of  algorithm  is  best  suited  to  certain  types  of  data.  For 
example,  the  linkage  methods  used  above  are  not  likely  to  be 
effective  for  less  spatially  separated  data,  due  to  their 
tendency  to  string  together  adjacent  clusters.  Also,  being  a 
divisive  method,  RPCLUS  would  tend  to  yield  different 
results  from  the  agglomerative  algorithms,  results  that  may  be 
more  accurate  at  recognizing  low-level  cluster  structure 
because  the  algorithm  has  had  fewer  steps  in  which  to  make 
irreversible  mi.siake.s.  Another  important  consideration  is  that 
RPCLUS  lends  it.self  to  very  efficient  implementation.  The 
construction  of  standard  splits  boils  down  to  a  .sorting 
operation  which  can  be  done  very  cheaply,  and  in  addition. 
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the  algorithm  seems  to  be  a  natural  for  parallel  processing. 
Other  hierarchical  clustering  methods  require  comparisons 
across  all  data  points  at  each  step.  RPCLUS,  due  to  its 
recursive  partitioning  strategy,  needs  only  to  keep  track  of  one 
portion  of  the  data  at  a  time  and  a  parallel  machine  can  readily 
farm  out  parts  of  the  work  to  subsets  of  its  computational 
resources. 

V.  SUMMARY 

A  new  algorithm  for  cluster  analysis  has  been  introduced 
which  draws  both  from  earlier  clustering  efforts  and  recent 
techniques  developed  for  use  in  classification  problems.  The 
algorithm  makes  good  intuitive  sense  and  has  been  shown  in 
Monte  Carlo  tests  to  perform  equally  as  well  as  many  other 
methods  used  frequently  in  multivariate  data  analysis  both  for 
the  recovery  of  cluster  membership  and  for  determining  the 
number  of  clusters  in  a  data  set.  Just  as  important,  the 
analyses  produced  by  this  method  are  representable  in  a 
simple  format  that  makes  understanding  of  data  structure 
easy  and  intuitive.  In  addition  to  its  value  as  an  alternative 
tool  for  the  data  analyst,  this  new  algorithm  also  possesses 
computational  advantages  over  some  other  popular  methods 
that  may  make  it  more  suitable  for  parallel  implementations 
on  very  large  data  sets. 

TECHNICAL  NOTE  :  All  data  sets  were  created 
using  random  number  generation  routines  contained  in  the 
S-PLUS  data  analysis  software  package  (Statistical  Sciences, 
Inc.  -  P.O.Box  85625  -  Seattle,  WA  98145).  The  complete 
linkage,  single  linkage  and  group  average  calculations  were 
performed  with  subroutines  also  found  in  the  S-PLUS 
package.  K-means  tests  were  run  using  software  developed 
under  DoD  contract  at  Los  Alamos  National  Laboratories. 
Recursive  partitioning  was  done  using  an  implementation 
developed  by  the  author. 
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Abstract 

Classification  trees  produced  by  a  recursive  partitioning 
scheme  such  as  CART  are  not  guaranteed  to  be  the  best 
tree  structured  classifiers  possible,  partly  because  the 
sequential  manner  by  ivhich  they  are  formed  does  not 
allow  for  “looking  ahead”.  In  some  cases,  altering  trees 
produced  by  CART  by  shifting  the  partition  boundaries 
results  in  improved  prediction  rules.  Simulated  anneal¬ 
ing  can  be  used  to  efficiently  search  for  trees  which  may 
perform  better  than  those  produced  by  CART. 

Introduction 

In  the  general  classification  problem,  it  is  known  that 
each  case  in  a  sample  belongs  to  one  of  a  finite  num¬ 
ber  of  possible  classes,  and  given  a  set  of  measurements 
for  a  case,  it  is  desired  to  correctly  predict  to  which 
class  the  case  belongs.  A  classifier  is  a  rule  which  as¬ 
signs  a  predicted  class  membership  based  on  a  set  of 
related  measurements,  Xi,  X2,  ■  ■  ■  ,Xk-  Taking  the  mea¬ 
surement  space  X  to  be  the  set  of  all  possible  values  of 
(xi, . . .  ,X}i),  and  letting  C  =  {ci ,  C2 , . . . ,  cj }  be  the  set 
of  possible  cleisses,  a  classifier  is  just  a  function  with  do¬ 
main  X  and  range  C.  It  is  normally  desirable  to  use  past 
experience  as  a  basis  for  making  new  predictions,  and  so 
it  follows  that  classifiers  are  usually  constructed  from  a 
learning  sample  consisting  of  cases  for  which  the  correct 
class  membership  is  known  in  addition  to  the  associated 
values  of  {x\, . . . ,  xk). 

Tree  structured  classifiers  are  constructed  by  making 
repetitive  splits  of  X  and  the  subsequently  created  sub¬ 
sets  of  X  so  that  a  hierarchical  structure  is  formed,  and 
a  plurality  rule  can  be  used  to  assign  a  predicted  class 
to  each  final  subdivision  of  X.  The  CART  method  of 
Breiman  et  al.  [2]  creates  binary  tree  structured  classi¬ 
fiers  by  recursively  partitioning  the  me^lsurement  space 
X  into  disjoint  subsets  Ai,  A2,  ■  ■  ■ ,  Aj  as  follows.  The 
measurement  space  X  is  first  divided  into  two  disjoint 
sets  by  splitting  it  along  a  hyperplane.  Next,  one  of  the 
sets  obtained  from  the  first  split  is  cut  with  a  second  hy¬ 
perplane,  resulting  in  the  division  of  X  into  three  disjoint 


‘The  author  gratefully  acknowledges  support  from  NSF  Grant 
DMS-9002237  He  would  also  like  to  thank  Sarah  Rosenblum,  R. 
Duane  King,  and  Kelly  J.  Buchanan  for  their  assistance. 


sets.  Successive  splits  can  be  similarly  made  until  it  is 
deemed  that  any  further  partitioning  of  X  could  not  pos¬ 
sibly  result  in  a  more  accurate  classifier.  At  each  step, 
the  selection  of  the  splitting  hyperplane  results  from  an 
attempt  to  minimize  overall  tree  impurity. 

It  is  clear  that  after  7—1  splits  have  been  made,  the 
measurement  space,  as  well  as  the  learning  sample,  has 
been  separated  into  7  disjoint  sets.  Letting  Ai, . . . ,  A/ 
be  the  subsets  making  up  the  partition  of  X,  these  sets 
can  be  used  in  the  following  way  to  construct  a  classifier. 
If  the  value  of  (si, . . . ,  Xfc)  for  an  object  to  be  classified 
belongs  to  the  set  A,,  then  the  predicted  class  for  this  ob¬ 
ject  is  the  dominant  class  of  the  members  of  the  learning 
sample  which  belong  to  A,.  That  is,  once  the  measure¬ 
ment  space  has  been  partitioned  into  sets  Ai, . . . ,  A/,  a 
classifier  can  be  created  by  using  a  simple  plurality  rule 
to  determine  a  mapping  from  {Aj, . . . ,  A/}  to  C. 

One  of  the  crucial  issues  that  is  addressed  by  the 
CART  method  is  the  decision  of  how  finely  X  should 
be  partitioned  into  a  collection  of  disjoint  subsets,  since 
too  few  or  too  many  subsets  will  result  in  a  loss  of  class 
prediction  accuracy.  On  one  hand,  if  too  small  of  a  tree  is 
chosen,  not  all  of  the  information  present  in  the  learning 
sample  will  be  fully  utilized.  Thus,  the  misclassification 
rate  will  be  higher  than  the  rate  for  a  larger  tree  having 
a  finer  partitioning  of  X .  On  the  other  hand,  if  a  tree  has 
too  many  terminal  nodes,  it  may  be  “paying  too  much 
attention  ”  to  the  specific  features  of  the  learning  sample 
and  may  not  accurately  reflect  the  structure  of  the  larger 
population  from  which  the  sample  was  taken.  Although 
the  resubstitution  misclassification  rate  decreases  as  the 
complexity  of  the  tree  increases,  it  is  not  necessarily  true 
that  the  probability  of  misclassification  becomes  smaller. 
For  instance,  it  is  possible  that  a  tree  with  a  large  enough 
number  of  nodes  can  have  a  resubstitution  misclassifica¬ 
tion  rate  of  zero,  but  such  a  tree  may  do  a  very  poor  job 
of  predicting  class  membership  for  new  observations. 

CART  carefully  selects  a  good  value  for  the  number  of 
sets  in  the  partition  of  X  using  either  cross  validation  or 
the  test  sample  method.  To  do  this,  CART  first  “grows” 
a  tree  having  too  many  sets  in  the  partition  and  then 
successively  “prunes”  this  tree  by  recombining  subsets 
of  X  that  were  previously  split  until  the  right  sized  tree 
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is  obtained,  using  either  cross  validation  or  a  test  sample 
(which  is  a  subset  of  the  original  learning  sample  that  is 
not  used  to  construct  the  tree,  but  is  solely  used  to  assess 
the  accuracy  of  the  various  candidates)  to  select  the  best 
classifier  from  the  sequence  of  candidates  encountered. 
In  other  words,  CART  is  a  stepwise  procedure  which 
initially  considers  all  of  the  variables  present  and  creates 
a  tree  which  is  too  complex.  Then,  the  choice  of  which 
variable  splits  will  be  used  in  the  final  tree  is  based  upon 
the  tree  encountered  in  the  pruning  process  which  has 
the  smallest  estimated  misclassification  rate. 

Even  though  the  CART  method  carefully  selects  the 
right  sized  tree,  the  classifier  obtained  isn’t  necessarily 
the  best  classification  tree  possible.  This  is  partly  due  to 
the  fact  that  CART  employs  a  “greedy”  algorithm  which 
prescribes  a  sequence  of  stepwise  optimal  moves  and  does 
not  allow  for  “looking  ahead”  in  order  to  examine  the  ef¬ 
fects  that  a  current  decision  will  have  on  the  ability  to 
create  subsequent  splits  leading  to  a  good  classification 
rule.  This  means  that  if  the  CART  method  yields  a 
classification  rule  based  on  a  partition  of  X  having  four 
sets,  then  that  rule  is  the  best  one  that  can  be  obtained 
by  cart’s  recursive  partitioning  and  pruning  scheme, 
but  it  is  not  necessarily  the  globally  optimal  solution 
if  all  ways  of  partitioning  X  into  four  sets  are  consid¬ 
ered.  That  is,  cart’s  reliance  on  a  descent  algorithm 
employing  a  sequential  decision  process  which  doesn’t 
allow  for  “looking  ahead”  results  in  the  fact  that  the 
classification  rule  that  it  produces  may  not  be  the  best 
tree  structured  classifier  possible.  However,  it  should  be 
noted  that  if  the  CART  method  is  altered  to  allow  it 
to  consider  one  or  more  succedent  moves  at  each  step, 
then  the  required  computation  time  would  be  vastly  in¬ 
creased,  making  this  way  of  altering  CART  to  search  for 
improved  tree  structured  rules  not  very  practical. 

A  procedure  that  could  use  CART’s  output  to  search 
for  an  improved  tree  structured  rule  would  be  to  let 
cart’s  tree  stipulate  how  many  subsets  the  partition 
of  X  should  include,  and  then  do  an  exhaustive  search 
for  the  best  classification  rule  amongst  all  sensible  par¬ 
titions  having  that  same  number  of  subsets.  However, 
unless  the  learning  sample  is  rather  small  and  simple 
there  can  be  far  too  many  competitors  to  examine,  and 
the  required  computation  time  could  be  excessive. 

Another  way  to  possibly  improve  a  classification  tree 
that  is  produced  by  a  recursive  method  such  as  CART  is 
to  shift  the  locations  of  the  partition  boundaries  while  re¬ 
taining  the  overall  nested  partitioned  structure  resulting 
from  the  recursive  partitioning  algorithm.  For  instance, 
if  CART  produces  a  tree  having  X  partitioned  into  four 
subsets,  then  one  could  search  for  a  better  tree  from 
among  the  class  of  four  subset  partitions  of  X  that  have 


each  split  defined  using  the  same  variables  that  CART 
used,  but  possibly  different  locations  for  the  cutting  hy¬ 
perplanes.  For  example,  if  the  first  decision  point  in  the 
CART  produced  tree  happens  to  be  “Is  X2  <  98.6?”, 
then  only  trees  having  an  initial  decision  of  the  form  “Is 
X2  <  7?”  will  be  considered  as  possible  candidates  for 
improvement.  Although  limiting  the  search  for  a  better 
tree  to  the  set  of  trees  having  the  same  general  struc¬ 
ture  as  the  CART  produced  tree  decreases  the  size  of 
the  candidate  pool  from  the  number  that  would  result 
if  all  partitions  having  the  same  number  of  sets  as  the 
CART  partition  were  considered,  for  large  data  sets  it 
may  still  be  infeasible  to  do  an  exhaustive  search  for  the 
best  such  partition  of  X. 

As  an  alternative  to  a  brute  force  search,  one  could 
begin  with  CART’s  tree  and  then  gradually  shift  the  lo¬ 
cations  of  the  partition  boundaries  in  order  to  search 
for  an  improved  tree.  One  way  to  do  this  would  be  to 
randomly  shift  the  locations  of  the  partition  boundaries 
and  then  determine  if  the  new  partitioning  of  X  reduces 
the  resubstitution  misclassification  rate.  If  so,  then  the 
associated  rule  can  be  adopted  as  the  best  rule  so  far. 
If  the  random  shifting  does  not  yield  improvement,  then 
the  new  partition  can  be  discarded  and  another  attempt 
at  shifting  the  boundaries  can  be  done  starting  from  the 
same  configuration  as  before.  Repeated  attempts  to  de¬ 
crease  the  resubstitution  misclassification  rate  by  shift¬ 
ing  the  partition  boundaries  belonging  to  the  best  rule 
so  far  may  result  in  iterative  improvement. 

Using  Simulated  Annealing 

An  undesirable  feature  of  the  preceding  scheme  is  that 
it  may  yield  a  solution  which  is  locally  optimal  without 
being  globally  optimal  as  well.  That  is,  the  overall  best 
partition  (having  the  same  general  form  as  the  CART 
produced  partition)  may  be  separated  from  the  initial 
CART  partition  by  a  “ridge”,  and  it  may  not  be  possible 
to  reach  the  globally  optimal  tree  from  the  initial  tree  if 
only  downhill  moves  are  allowed. 

Simulated  annealing  is  an  iterative  method  of  opti¬ 
mization  which  makes  use  of  a  random  number  genera¬ 
tor.  If  the  method  of  simulated  annealing  is  combined 
with  the  idea  of  randomly  shifting  the  partition  bound¬ 
aries  to  search  for  a  better  tree,  then  it  may  be  possible 
to  avoid  getting  stuck  at  a  solution  which  is  only  locally 
optimal.  This  is  due  to  the  fact  that  simulated  anneal¬ 
ing  allows  for  an  occasitmal  uphill  move.  It  is  hoped  that 
if  the  parameters  a.ssociated  with  a  simulated  annealing 
algorithm  are  properly  selected,  then  successive  random 
pertubations  of  the  partition  boundaries  will  eventually 
result  in  a  classification  tree  which  may  perform  better 
than  the  CART  tree. 

The  general  scheme  for  the  basic  version  of  simulated 
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annealing  employed  for  this  project  can  be  described  in 
the  following  way.  At  the  start  of  each  new  iteration,  the 
current  configuration  of  a  system  is  slightly  altered  in  a 
random  way  to  obtain  a  neighboring  trial  configuration. 
If  the  trial  configuration  is  better  than  the  current  con¬ 
figuration  (the  configuration  that  existed  at  the  start  of 
the  iteration  step  prior  to  the  random  altering),  then  the 
trial  configuration  is  automatically  accepted  and  taken 
to  be  the  current  configuration  for  the  next  step.  How¬ 
ever,  if  the  trial  configuration  is  not  better  than  the  cur¬ 
rent  configuration  for  the  iteration  step,  then  the  trial 
configuration  is  neither  automatically  accepted  nor  re¬ 
jected  as  being  the  new  current  configuration  for  the 
next  iteration  step.  Instead,  a  pseudo  random  number 
generator  will  be  used  to  randomly  decide  whether  or  not 
to  accept  the  uphill  move.  The  probability  with  which  a 
less  favorable  configuration  is  accepted  depends  upon  a 
couple  of  factors;  the  degree  to  which  the  trial  configu¬ 
ration  is  worse  and  the  position  of  the  iteration  step  in 
the  whole  sequence  of  steps  which  make  up  the  anneal¬ 
ing  process.  The  more  unfavorable  a  trial  configuration 
is,  the  less  likely  it  is  to  be  accepted.  Also,  a  less  fa¬ 
vorable  configuration  is  more  likely  to  be  accepted  in  an 
uphill  move  if  it  occurs  near  the  beginning  of  the  anneal¬ 
ing  process  than  it  would  if  it  occurs  after  the  process 
has  been  run  through  many  iterations.  If  an  uphill  move 
is  rejected,  then  the  current  configuration  remains  the 
same  for  the  next  iteration. 

The  simulated  annealing  approach  was  originally  de¬ 
veloped  by  Metropolis  et  al.  [5]  as  a  way  of  minimizing 
complex  energy  functions  associated  with  N  particle  sys¬ 
tems,  and  both  the  general  scheme  and  the  associated 
terminology  are  related  to  statistical  mechanics  and  the 
behavior  of  N  particle  systems  which  are  acted  upon  by 
a  heat  bath.  Although  the  approach  has  been  success¬ 
fully  applied  to  many  problems  having  little  to  do  with 
physics  (see  [1,  3,  1]),  it  is  common  practice  to  retain 
the  physicists’  terminology  of  energy  and  temperature. 
Basically,  the  energy  E  is  some  function  of  an  N  particle 
system  that  one  desires  to  minimize,  and  the  tempera¬ 
ture  T  is  a  control  parameter  that  effects  the  probability 
p-AE/T  uphill  move  being  accepted.  Here  AE  is 

the  increase  in  the  energy  function  associated  with  the 
uphill  shift  under  consideration. 

It  is  common  practice  to  lower  the  temperature  (thus 
decreasing  the  probabilities  of  accepting  uphill  moves) 
as  the  annealing  process  continues.  As  typically  imple¬ 
mented,  this  “cooling”  is  carried  out  using  two  additional 
parameters,  the  temperature  length  L  and  the  cooling 
ratio  r,  along  with  the  temperature  T.  Here  L  is  a  fixed 
positive  integer,  r  is  a  fixed  real  number  belonging  to 
(0,  I),  and  T  is  a  variable  parameter  which  takes  on  a 


decreasing  sequence  of  positive  real  values  Tj,  T2,  T3, . . .. 
The  temperature  T  is  kept  constant  while  a  sequence 
of  L  trial  configurations  are  considered.  Then  T  is  de¬ 
creased  by  multiplying  its  value  by  the  cooling  ratio  r 
(Tj+i  =  rT,),  and  L  additional  iterations  are  done  with 
the  resulting  lower  temperature.  This  procedure  contin¬ 
ues  until  it  is  deemed  that  significant  further  improve¬ 
ment  is  unlikely,  at  which  point  the  process  is  terminated 
according  to  an  appropriate  stopping  rule.  It  is  hoped 
that  when  the  process  reaches  the  “frozen  state”  (the 
point  at  which  the  temperature  is  not  lowered  and  no 
additional  shifts  are  tried),  the  current  solution  is  close 
to  being  globally  optimal. 

In  summary,  the  temperature  is  a  control  parame¬ 
ter  which  determines  the  likelihood  of  uphill  moves  be¬ 
ing  accepted.  At  the  start  of  the  annealing  process, 
the  temperature  is  high  so  that  hopefully  enough  uphill 
moves  will  be  made  so  that  the  configuration  will  not  be 
trapped  at  a  local  minimum.  As  the  annealing  process 
continues,  the  temperature  is  lowered  so  that  the  series 
of  trial  configurations  can  move  more  efficiently  towards 
a  final  configuration  having  low  energy.  In  many  appli¬ 
cations  of  the  simulated  annealing  algorithm,  key  issues 
are  the  determination  of  good  choices  for  the  initial  tem¬ 
perature  T),  the  temperature  length  L,  and  the  cooling 
ratio  r,  along  with  developing  a  good  method  with  which 
to  randomly  shift  to  new  trial  configurations. 

In  the  classification  tree  problem,  the  members  of  the 
learning  sample  can  play  the  role  of  the  N  particles,  and 
E  can  be  the  resubstitution  estimate  of  the  misclassifi- 
cation  rate,  which  is  just  the  proportion  of  the  learning 
sample  that  would  be  misclassified  by  the  classification 
tree.  It  should  be  noted  that  while  the  resubstitution  es¬ 
timate  of  the  misclassification  rate  is  not  a  good  criteria 
to  use  in  the  selection  of  the  right  sized  tree,  there  seems 
to  be  nothing  wrong  with  minimizing  the  resubstitution 
estimate  of  the  misclassification  rate  when  an  attempt  is 
made  to  find  the  best  tree  structured  classification  rule 
from  among  all  those  having  the  same  number  of  subsets 
in  the  partition  of  X  and  the  same  nested  structure  of 
partition  subsets. 

Description  of  the  Experiment 

A  waveform  recognition  problem  due  to  Breiman  et  al. 
[2]  is  employed  as  a  lest  bed  for  the  improvement  scheme. 
The  data  was  constructed  using  random  number  gener¬ 
ators  so  that  a  sufficiently  large  amount  of  independent 
observations  could  be  made  available  for  a  satisfactory 
assessment  of  the  performance  of  the  simulated  anneal¬ 
ing  technique. 

The  data  points  used  are  21-dimensional  vectors  of  the 
form  X,  -  (*1,1,  ®i,2i  •  •  •1  2;,  21).  Each  data  point  con¬ 
sists  of  a  random  convex  combination  of  two  fixed  wave- 
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forms,  to  which  Gaussian  noise  has  been  added.  The 
fixed  waveforms  used  were  selected  from  a,  b,  and  c, 
where 

a  =  (0, 1,  2, 3, 4,  5, 6, 5,  4,  3,  2, 1, 0, 0,  0.  0, 0, 0, 0, 0, 0), 

b  =  (0,0, 0,0, 0,0, 0,0, 0,1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1,0), 

and 

c  =  (0,0, 0,0,0, 1,2, 3, 4,  5,  6, 5, 4,  3,  2, 1, 0, 0, 0, 0, 0). 

Data  points  belonging  to  Class  1  are  of  the  form 
X,  =  XI, a  +  (1  -  xtt)b  +  Bi,  where  xt,  is  a  uniform  (0,1) 
random  deviate  and  =  (e^^i,  ei  ^,  ■  ■  • ,  ei,2i)  is  a  vector 
of  observed  values  for  21  i.i.d.  Gaussian  random  vari¬ 
ables.  Similarly,  data  points  belonging  to  Class  2  are  of 
the  form  x,  =  xi^a  -I-  ( 1  —  xi^  )c  -I-  ,  and  members  of  Class 

3  are  of  the  form  x,  =  xt,b  -t-  (1  —  xx,)c  -t-  e,. 

The  task  of  identifying  the  proper  class  associated 
with  one  of  the  random  vectors  described  above  is  made 
difficult  due  to  two  primary  reasons.  First,  whenever  Ui 
is  close  to  either  0  or  1,  it  is  not  easy  to  distinguish  be¬ 
tween  two  possible  classes.  For  instance,  a  Class  1  vector 
with  u,  close  to  0  may  look  very  much  like  a  Class  3  ob¬ 
servation  having  xt,  close  to  1  —  the  general  shape  of 
both  will  resemble  waveform  b.  The  identification  of  the 
correct  class  is  further  hindered  by  the  additive  Gaussian 
noise.  It  should  be  noted  that  the  standard  deviation  as¬ 
sociated  with  the  noise  is  rather  large  compared  to  the 
average  magnitude  of  the  waveform  coordinates. 

A  total  of  twenty  data  sets  consisting  of  300  waveforms 
each  were  generated.  The  class  for  each  observation 
was  randomly  selected,  with  the  three  possible  classes 
all  having  the  same  likelihood  of  being  chosen.  CART 
was  then  applied  to  construct  tree  structured  cleissifiers. 
Both  the  Gini  and  the  twoing  splitting  criteria  were  used 
with  each  data  set,  and  so  altogether  CART  was  used 
to  produce  forty  classification  trees.  In  all  cases,  linear 
combination  splits  using  more  than  one  variable  were  dis¬ 
allowed.  By  using  only  single  variable  splits,  all  of  the 
decisions  associated  with  the  tree  structured  rules  are  of 
the  form  “Is  x,,  i  ^  7?”.  This  restriction  on  tree  forma¬ 
tion  greatly  simplified  the  programming  of  the  annealing 
algorithm,  and  also  led  to  decreased  running  times. 

The  number  of  nodes  in  the  classification  trees  pro¬ 
duced  by  CART  ranged  from  3  to  17.  Since  the  level 
of  difficulty  of  writing  a  program  to  perform  the  simu¬ 
lated  annealing  improvement  scheme  increases  with  the 
complexity  of  the  tree,  it  was  decided  to  only  investigate 
the  performance  of  the  method  on  tree  structured  rules 
possessing  three,  four,  or  five  nodes. 

Of  the  fifteen  trees  having  five  or  fewer  nodes,  some 
were  identical,  and  there  were  only  thirteen  distinct  trees 


to  be  examined.  For  each  of  the  thirteen  classification 
trees  considered,  numerous  variations  of  the  simulated 
annealing  technique  were  investigated.  One  source  of 
variation  wcis  due  to  using  different  values  for  the  cool¬ 
ing  schedule  control  parameters  T\,  L,  and  r.  Values 
tried  for  Tj  were  0.000625,  0.00125,  0.0025,  and  0.005. 
The  cooling  ratio  r  was  assigned  the  values  0.25,  0.5, 
0.75,  and  0.875,  and  100,  150,  and  200  were  used  for  the 
temperature  length  L. 

Various  methods  for  obtaining  trial  configurations 
were  also  considered.  One  way  to  produce  a  trial  con¬ 
figuration  from  the  current  one  is  to  shift  only  a  single 
boundary  of  the  partition.  Alternatively,  more  than  one 
partition  boundary  could  be  shifted  to  create  a  new  tree. 
Both  the  single  shift  approach  and  the  multiple  shifts  ap¬ 
proach  were  investigated.  With  the  single  shift  method, 
a  boundary  is  randomly  selected  to  be  moved,  and  each 
time  a  shift  is  to  be  made,  each  boundary  in  the  tree  is 
given  an  equal  chance  of  being  selected.  Alternatively, 
for  every  perturbation  in  the  multiple  shifts  scheme,  each 
boundary  is  shifted  with  probability  0.5,  independent  of 
what  occurs  with  the  other  boundaries.  If  no  boundaries 
are  selected  for  movement,  additional  attempts  are  made 
until  at  least  one  boundary  shift  occurs. 

Another  issue  connected  with  the  shifting  procedure 
concerns  the  magnitude  of  the  shifts.  If  it  is  decided 
that  a  boundary  will  be  shifted,  the  shift  could  be  slight 
so  that  only  a  small  number  of  data  points  fall  into  a 
node  different  from  the  one  that  they  were  in  previously, 
or  the  size  of  the  shift  could  be  much  larger  so  that  an 
appreciable  proportion  of  the  data  points  change  their 
node  membership. 

In  the  algorithm  used  to  perform  the  annealing,  a  pa¬ 
rameter  K  is  used  to  specify  the  greatest  number  of  data 
points  that  can  change  node  membership  when  a  bound¬ 
ary  shift  is  performed.  When  a  boundary  is  selected  to 
be  shifted,  the  direction  of  the  shift  is  first  determined 
and  then  the  number  of  points  to  change  node  member¬ 
ship  is  randomly  chosen  from  {1, 2, . . . ,  if}.  Values  tried 
for  K  were  5,  10,  20,  and  30. 

Another  option  that  was  explored  deals  with  the  ini¬ 
tial  configuration  for  the  annealing  process.  The  use  of 
cart’s  tree  as  a  starting  point  was  investigated,  as  was 
using  a  randomly  selected  configuration  having  the  same 
general  structure  as  the  CART  partitioning. 

A  very  basic  version  of  the  general  simulated  anneal¬ 
ing  algorithm  was  used  in  order  to  attempt  to  produce 
more  accurate  classification  rules.  It  was  hoped  that 
shifting  the  boundary  locations  to  configurations  hav¬ 
ing  a  lower  resubstitution  misclassification  rates  would 
produce  tree  structured  rules  for  which  the  misclassi¬ 
fication  rates  would  be  lower  when  new,  independent 
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data  was  applied,  and  this  led  to  using  the  resubsti¬ 
tution  misclassification  rate  as  the  energy  function  to 
be  minimized.  That  is,  the  energy  was  taken  to  be 
E  =  (no.  obs.  misclassified)/300. 

A  geometric  cooling  schedule  having  fixed  length  was 
employed.  This  means  that  the  temperature  levels  in  the 
sequence  Tj,  T2,  Ta, . . .  were  related  through  the  equality 
Ti+i  =  rTt,  and  a  fixed  number,  i,  of  trial  configura¬ 
tions  were  considered  at  each  temperature  level  in  the 
sequence.  The  stopping  rule  utilized  was  as  follows;  the 
annealing  process  was  terminated  whenever  none  of  the 
L  trial  configurations  generated  at  a  given  temperature 
level  resulted  in  an  accepted  shift  and  a  new  value  of  E. 

The  following  description  is  a  short  summary  of  the 
algorithm.  At  each  temperature  level  encountered,  L 
trial  configurations  are  generated  by  shifting  one  or  more 
partition  boundaries.  For  each  trial  configuration,  the 
energy  E  (which  is  just  the  resubstitution  misclassifica¬ 
tion  rate)  is  determined.  If  E  is  reduced,  the  shift  is 
accepted  and  another  trial  configuration  is  obtained  by 
jiggling  the  boundaries  of  this  newly  accepted  partition. 
If  E  is  not  lower  for  a  trial  configuration,  then  the  con¬ 
figuration  is  accepted  in  an  uphill  move  with  probability 
g-AE/T  rejected  otherwise.  Here  AE  is  the  increase 
in  energy  associated  with  the  trial  configuration,  and  T 
is  the  current  temperature.  If  a  trial  configuration  is  not 
accepted,  then  the  next  trial  configuration  is  obtained 
starting  from  the  same  tree  that  was  altered  previously; 
and  a  new  random  perturbation  of  the  boundaries  is  pro¬ 
duced.  If  one  or  more  of  the  trial  configurations  at  a 
given  temperature  T  results  in  an  accepted  shift  with  a 
change  in  E,  then  the  temperature  T  is  lowered  to  rT, 
and  L  additional  trial  configurations  are  produced.  Oth¬ 
erwise,  the  process  is  terminated  and  the  current  config¬ 
uration  is  taken  to  be  the  classification  rule. 

The  search  for  improved  classification  rules  was  pur¬ 
sued  by  applying  the  simulated  annealing  algorithm  to 
the  trees  formed  from  the  waveform  data.  The  previ¬ 
ously  specified  values  of  Tj  (0.000625,  0.00125,  0.0025, 
and  0.005),  r  (0.25,  0.5,  0.75,  and  0.875),  L  (100,  150, 
and  200),  and  K  (5,  10,  20,  and  30)  were  used  in  the 
annealing  process.  Every  possible  combination  of  these 
parameter  values  was  tried,  making  a  total  of  192  com¬ 
binations  for  each  tree.  Also,  four  different  variations 
of  the  annealing  algorithm  (random  start/single  shift, 
CART  start /.single  shift,  random  start/multiple  shifts, 
and  CART  start/multiple  shifts)  were  tried  with  each 
combination  of  parameter  values.  So,  in  all,  768  distinct 
ways  of  doing  the  annealing  were  tried  with  each  tree. 

The  annealing  algorithm  was  implemented  by  running 
C  programs  on  an  Intel  Hypercube  concurrent  computer. 
For  each  tree  and  each  distinct  case  of  the  annealing 


scheme,  eight  annealing  trials  were  performed  (and  so  ^ 

for  each  tree,  a  total  of  6144  attempts  were  made  to  j 

minimize  the  resubstitution  misclassification  rate  using 

simulated  annealing).  Each  of  the  eight  trials  used  differ-  < 

ent  seeds  for  the  pseudo  random  number  generator,  and 

so  the  trials  did  not  always  result  in  the  same  tree  config-  j 

uration  at  the  frozen  state.  Each  trial  was  performed  on 

a  different  node  of  the  Hypercube,  and  the  results  from 

each  of  the  eight  trials  were  then  sent  to  a  host  program 

where  they  were  compared  and  summarized. 

Results 

A  vast  amount  of  computer  time  was  required  in  order  to 
carry  out  the  experiment.  Some  jobs  in  which  6144  an¬ 
nealing  trials  were  performed  on  a  single  tree  took  nearly 
half  a  day  to  complete.  After  all  of  the  runs  were  made, 
the  results  from  this  experiment  were  closely  examined 
to  determine  which  combination  of  parameter  values  and 
which  of  the  four  variations  of  the  annealing  algorithm 
performed  best.  Also,  the  degree  of  improvement  was 
assessed  for  the  new  trees  produced. 

The  algorithm  performance  results  will  now  be  sum¬ 
marized  by  first  considering  each  of  the  four  variations 
of  the  annealing  algorithm  separately.  For  the  varia¬ 
tion  which  has  the  annealing  process  beginning  with  ran¬ 
domly  chosen  boundary  locations  and  allows  for  only  a 
single  boundary  to  be  shifted  with  each  random  per¬ 
turbation,  it  turned  out  that  for  each  tree  the  overall 
minimum  misclassification  rate  was  obtained  on  at  least 
one  trial  for  numerous  combinations  of  parameter  values. 

However,  the  following  set  of  parameter  values  seemed 
to  work  best  overall:  T\  =  0.005,  r  =  0.875,  L  =  150, 
and  K  —  20. 

The  second  annealing  scheme  investigated  in  the 
search  for  optimal  algorithm  performance  also  used  the 
single  shift  approach,  but  instead  of  a  randomly  chosen 
initial  configuration,  the  CART  boundaries  were  taken 
to  be  the  starting  point  at  which  the  annealing  process 
began.  For  each  tree,  the  minimum  resubstitution  mis¬ 
classification  rate  obtained  was  exactly  the  same  as  it 
was  for  the  “random  start”  variation.  As  before,  nu¬ 
merous  combinations  of  parameter  values  resulted  in  the 
minimum  misclassification  rate,  but  with  this  variation 
the  following  values  were  found  to  be  most  favorable:  Tj 
=  0.00125,  r  =  0.875,  L  =  200,  and  K  =  10. 

It  can  be  noted  that  the  best  value  for  r  is  the  same 
as  it  is  for  the  “random  start”  scheme,  but  the  optimiz¬ 
ing  values  for  K,  L,  and  T\  are  slightly  different.  With 
the  “CART  start”  variation,  it  seems  to  be  better  to  use 
smaller  values  for  K  and  Ti.  These  smaller  values  will 
lead  to  more  moderate  perturbations  of  the  boundaries, 
as  well  as  allow  for  fewer  uphill  moves.  The  result  is  a 
more  controlled  cooling,  for  which  the  energy  function 
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decreases  steadily  towards  a  lower  value.  With  the  “ran¬ 
dom  start”  method,  for  which  the  starting  point  could 
be  quite  far  from  the  optimal  solution,  a  larger  value 
for  Ti  prevented  the  process  from  being  stopped  at  a  lo¬ 
cal  minimum  (which  might  be  far  from  the  overall  min¬ 
imum)  by  producing  greater  probabilities  of  escape  via 
uphill  moves.  Also,  a  larger  value  for  K  allows  for  wilder 
movement  of  the  boundaries,  which  seems  appropriate  if 
the  starting  points  for  the  boundaries  can  be  far  from 
the  minimizing  locations. 

The  overall  performance  of  the  two  single  shift  varia¬ 
tions  differed  little.  Both  schemes  reached  the  same  min¬ 
imum  value  for  the  resubstitution  misclassification  rates, 
and  with  the  parameters  set  favorably,  both  schemes 
reached  the  minimum  value  with  high  frequency.  How¬ 
ever,  the  “CART  start”  variation  was  quicker  for  each 
tree  examined,  typically  reducing  the  average  number 
of  perturbations  required  to  reach  the  frozen  state  by  a 
factor  of  about  two  or  three. 

For  the  third  variation  of  the  annealing  scheme,  the 
single  shift  procedure  is  replaced  by  the  multiple  shifts 
procedure,  and  a  randomly  selected  initial  point  is  em¬ 
ployed.  With  regard  to  the  best  overall  choice  of  pa¬ 
rameters,  this  case  leads  to  the  selection  of  Ti  =  0.005, 
T  =  0.875,  L  =  200,  and  K  =  10.  In  all  of  the  cases,  the 
minimum  value  found  was  the  same  as  it  is  for  the  other 
two  schemes. 

Comparing  the  set  of  parameters  which  worked  best 
for  this  random  start/multiple  shifts  scheme  with  the 
set  that  worked  best  for  the  random  start/single  shift 
scheme,  it  can  be  seen  that  Tj  and  r  are  the  same,  but 
the  values  for  L  and  K  differ.  When  only  a  single  shift 
is  made  for  each  perturbation,  the  best  choice  for  K 
is  20,  which  can  produce  rather  large  boundary  shifts. 
When  multiple  shifts  are  allowed,  a  smaller  K  works 
better.  This  observation  may  lead  one  to  the  tentative 
conclusion  that  the  overall  amount  of  shifting  should  not 
be  allowed  to  be  too  large. 

The  final  ceise  is  similar  to  the  previous  one  in  that  it 
involves  multiple  shifts  of  boundaries,  but  instead  of  the 
randomly  chosen  starting  boundaries,  the  CART  tree  is 
used  as  a  starting  point.  This  final  case  resulted  in  the 
the  same  minimum  misclassification  rates  and  yielded 
T\  —  0.0025,  r  =  0.75,  L  =  200,  and  K  =  5  as  the  best 
combination  of  parameter  values.  Overall,  this  variation 
did  not  work  quite  as  well  as  the  other  three. 

The  lower  initial  temperature  and  the  smaller  value 
of  K  will  result  in  a  more  controlled  cooling  than  that 
which  occurs  with  the  larger  parameter  values  of  the  ran¬ 
dom  start/multiple  shifts  case  (where  Ti  =  0.005  and  K 
—  10  worked  best).  When  a  comparison  is  made  with 
the  CART  start/single  shift  variation  (for  which  the  best 


value  for  K  is  10),  it  can  be  noted  that  the  optimal  value 
of  K  is  smaller  in  this  CART  start/multiple  shifts  vari¬ 
ation.  This  observation  serves  to  reinforce  the  tentative 
conclusion  reached  earlier  concerning  the  desirability  of 
constraining  the  overall  size  of  the  maximum  configura¬ 
tion  shift. 

To  summarize,  numerous  ways  of  performing  the  sim¬ 
ulated  annealing  resulted  in  the  same  minimum  resubsti¬ 
tution  misclassification  rate  for  each  tree,  and  so  it  ap¬ 
pears  that  the  technique  is  rather  robust.  However,  with 
the  randomly  chosen  starting  points,  the  minimum  rate 
is  achieved  with  slightly  greater  frequency.  On  the  other 
hand,  with  the  CART  starting  points  it  generally  took 
considerably  less  time  to  reach  the  minimum  misclassifi¬ 
cation  rate.  Whether  the  “random  start”  or  the  “CART 
start”  method  was  utilized,  the  single  shift  method  per¬ 
formed  a  little  better. 

It  seems  that  for  all  four  variations  a  more  gradual 
cooling,  which  results  from  a  larger  value  of  r,  is  supe¬ 
rior  to  using  a  small  value  for  r  and  obtaining  a  quicker 
cooling.  Furthermore,  with  each  of  the  four  variations 
it  was  found  that  the  performance  deteriorated  when¬ 
ever  Ti  was  made  too  small.  A  small  value  of  Ti  de¬ 
creases  the  probability  of  uphill  moves,  and  this  resulted 
in  a  greater  likelihood  of  the  solutions  being  trapped  at 
local  minimums.  It  can  also  be  observed  that  when  a 
randomly  selected  starting  point  is  used  instead  of  the 
CART  solution  starting  point,  Ti  and  K  should  be  cho¬ 
sen  to  be  larger  in  order  to  allow  for  a  wilder  shifting 
of  the  boundaries  before  the  cooling  severely  limits  the 
chance  of  escaping  from  local  minimums.  Also,  it  can  be 
noted  that  the  multiple  shifts  variations  work  better  for 
smaller  values  of  K  than  those  which  are  preferred  with 
the  single  shift  variations.  This  leads  one  to  conclude 
that  it  is  best  to  limit  the  overall  amount  of  shifting  al¬ 
lowed.  This  notion  is  additionally  supported  by  the  fact 
that  none  of  the  four  cases  worked  best  when  K  was 
set  as  30.  To  conclude  these  remarks  concerning  the  pa¬ 
rameter  values,  it  should  be  stated  that  the  experiment 
performed  has  not  ruled  out  the  possibility  that  the  per¬ 
formance  could  be  improved  by  using  larger  values  for  r 
and  L.  Unfortunately,  larger  values  for  r  and  L  would 
require  longer  running  times  for  the  annealing  programs. 

The  simulated  annealing  process  produced  new  classi¬ 
fication  trees  by  randomly  shifting  partition  boundaries 
until  the  resubstitution  misclassification  rate  seemed  to 
be  lowered  as  much  as  possible.  In  each  of  the  thir¬ 
teen  cases  examined,  the  annealing  experiment  produced 
more  than  one  tree  having  the  lowest  misclassification 
rate.  In  order  to  assess  the  typical  amount  of  improve¬ 
ment  due  to  the  application  of  the  algorithm  to  CART 
produced  trees  for  this  setting,  it  was  decided  to  examine 
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the  performance  of  the  tree  which  was  most  frequently 
produced  by  the  annealing  process  in  each  case. 

For  eleven  of  the  thirteen  cases  considered,  the  resub¬ 
stitution  misclassfication  rate  for  the  classification  tree 
produced  by  the  annealing  algorithm  was  lower  than 
the  resubstitution  misclassification  rate  for  the  original 
CART  produced  tree,  and  for  the  other  two  cases  the 
resubstitution  misclassification  rate  was  the  same  as  it 
was  for  the  CART  rule.  However,  just  because  the  re¬ 
substitution  misclassification  rate  is  typically  reduced  by 
the  annealed  trees,  it  is  not  necessarily  true  that  the  an¬ 
nealed  trees  are  realiy  more  accurate.  In  order  to  assess 
the  amount  of  improvement  produced  by  the  annealing 
procedure,  the  true  misclassification  rates  of  the  result¬ 
ing  trees  can  be  estimated  using  test  samples  consisting 
of  observations  which  are  independent  of  the  observa¬ 
tions  that  were  used  to  create  the  trees.  The  accuracy 
of  the  CART  produced  trees  can  be  assessed  using  the 
same  test  samples,  and  then  the  estimated  misclassifica¬ 
tion  rates  can  be  compared. 

Recalling  that  a  total  of  twenty  data  sets  of  300  obser¬ 
vations  each  were  originally  generated  in  the  same  way, 
it  is  clear  that  test  samples  of  independent  observations 
can  be  produced  in  the  following  manner.  To  assess  the 
accuracy  of  trees  produced  from  a  given  data  set,  the 
observations  in  all  of  the  other  nineteen  data  sets  can 
be  combined  to  serve  m  a  test  sample.  By  doing  this, 
test  samples  for  each  tree  would  consist  of  5,700  obser¬ 
vations,  none  of  which  were  used  in  the  construction  of 
the  tree  being  evaluated. 

In  ten  of  the  thirteen  cases,  the  unbiased  estimate  of 
the  true  misclassification  rate  provided  by  the  test  sam¬ 
ple  was  lower  for  the  tree  which  was  sharpened  by  the 
annealing  process.  However,  it  should  also  be  stated  that 
the  observed  differences  were  typically  quite  small,  and 
it  is  therefore  natural  to  wonder  whether  or  not  all  of 
the  observed  differences  are  indicative  of  any  real  differ¬ 
ence  in  the  predictive  ability  of  the  trees  since  it  could  be 
that  the  observed  differences  in  the  estimated  misclassifi¬ 
cation  rates  can  easily  be  attributed  to  chance.  To  inves¬ 
tigate  this  query,  hypothesis  tests  were  performed  (Mc- 
Nemar’s  test  was  used)  in  order  to  make  inferences  about 
the  differences  between  dependent  proportions.  In  one 
of  the  thirteen  cases,  the  performance  of  the  annealed 
tree  is  significantly  better  when  tested  at  a  =  0.01,  and 
in  four  of  the  thirteen  cases,  the  performance  of  the  an¬ 
nealed  tree  is  significantly  better  when  tested  at  a  =0.1. 
None  of  the  thirteen  annealed  trees  were  found  to  be  sig¬ 
nificantly  worse  when  a  =  0.05  tests  were  done,  and  only 
one  was  found  to  be  significantly  worse  when  a  =  0.1 
tests  were  performed. 

Overall,  it  seems  reasonable  to  conclude  that  the  ap¬ 


plication  of  the  simulated  annealing  method  is  beneficial 
for  the  waveform  data,  and  that  if  the  resubstitution  mis¬ 
classification  rate  is  decreased  then  the  true  misclassifica¬ 
tion  rate  may  be  slightly  decreased.  Of  course,  it  could 
be  argued  that  the  typical  amount  of  improvement  is 
somewhat  negligible  since,  on  the  average,  the  annealed 
trees  were  observed  to  do  only  a  little  better  than  the 
original  CART  trees.  However,  the  small  amounts  of 
improvement  could  be  largely  attributed  to  the  fact  that 
the  CART  trees  actually  do  a  pretty  good  job  with  this 
data,  and  there  simply  wasn’t  a  lot  of  room  for  improve¬ 
ment.  In  fact,  when  a  brute  force  search  for  the  lowest 
resubstitution  misclassification  rate  was  performed  for 
all  of  the  three  and  four  node  trees  (with  the  searches 
limited  to  the  class  of  trees  having  the  same  general  stru- 
ture  as  the  CART  trees),  in  each  case  it  was  determined 
that  the  annealing  process  reached  the  minimum  rate 
possible.  Furthermore,  it  was  found  that  the  minimum 
resubstitution  misclassification  rate  was  obtained  by  an¬ 
nealing  much  more  quickly  than  by  an  exhaustive  search 
—  for  the  four  node  trees  the  average  time  required  for  a 
brute  force  search  was  greater  than  the  time  required  for 
a  set  of  eight  annealing  trials  by  a  factor  of  abrvt  650 
(and  for  five  node  trees  the  difference  would  be  much 
larger).  All  in  all,  the  results  of  this  simulated  anneal¬ 
ing  experiment  can  be  taken  as  encouragement  that  the 
method  may  be  an  efficient  way  to  obtain  improved  tree 
structured  classifiers  in  situations  where  CART  leaves 
some  room  for  improvement. 
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Abstract 

A  simple  modification  of  the  Classification  and  Regres¬ 
sion  Tree  (CART)  algorithm  of  Breiman,  Friedman,  01- 
shen  and  Stone  (1984)  that  yields  K-group  stratifications 
is  presented.  Such  stratifications  can  be  useful  for  de¬ 
scribing  patient  prognosis. 

1  Introduction 

Classification  and  regression  trees  have  found  applica¬ 
tions  in  many  fields  including,  pattern  recognition,  artifi¬ 
cial  intelligence  and  medicine.  Trees  have  several  advan¬ 
tages  compared  to  classical  methods;  they  are  completely 
non-parametric  and  include  powerful  variable  subset  se¬ 
lection;  they  are  robust  to  outliers  in  the  covariate  space 
and  are  easily  used  on  a  wide  variety  of  data  structures. 
In  addition,  they  yield  results  that  can  be  expressed  as  a 
binary  decision  tree  that  allows  fast  prediction  and  that 
is  often  easily  interpreted.  For  applications  in  medicine 
it  is  the  decision  tree  representation  that  is  probably  the 
greatest  attraction  to  clinicians.  The  results  are  con¬ 
sistent  with  how  some  medical  researchers  think  about 
certain  problems,  which  can  lead  to  easier  interpretation 
and  communication  of  statistical  results. 

Tree-based  regression  models  are  constructed  by  re¬ 
cursively  partitioning  the  data  and  the  covariate  space 
into  groups  that  minimize  some  measure  of  impurity,  for 
instance  residual  sum  of  squares  for  continuous  response 
data  or  binomial  deviance  for  binary  response  data.  The 
partitioning  typically  continues  until  there  are  only  a  few 
observations  in  each  group  and  the  binary  tree  represent¬ 
ing  the  partitioning  is  large;  this  is  done  to  avoid  missing 
structure. 

While  tree-bawed  methods  have  been  available  since 
Morgan  and  Sonquist  (1963),  advances  in  the  method¬ 
ology  including  not  limiting  the  tree  growth  and  using 
an  optimal  pruning  algorithm  with  cross-validated  esti¬ 
mates  of  prediction  error  to  choose  the  size  of  the  tree 
were  introduced  in  the  Classification  and  Regression  TVee 


(CART)  algorithm  of  Breiman,  Friedman,  Olshen  and 
Stone  (1984)  (BFOS). 

After  choosing  a  tree  of  about  the  right  size  there  can 
be  a  simplification  of  the  description  by  further  combin¬ 
ing  nodes  that  are  close  in  terms  of  response  and/or  in 
the  covariate  space.  These  nodes  need  not  be  adjacent 
in  the  presented  tree  structure.  In  terms  of  studying  a 
patients  outcome,  this  recombination  of  many  possible 
terminal  nodes  could  lead  to  a  few  descriptive  classes, 
say  “good  prognosis” , “fair  prognosis”,  and  “poor  prog¬ 
nosis.”  Such  prognostic  stratifications  can  also  be  useful 
the  development  of  staging  schemes  that  can  be  used 
in  the  development  of  new  clinical  trials.  The  problem 
of  development  of  prognostic  stratification  rules  was  the 
motivation  for  the  technique  presented  here. 

Below  I  outline  a  variation  of  the  CART  regression  al¬ 
gorithm  that  recombines  of  possibly  non-adjacent  nodes, 
to  yield  a  tree  based  stratification: 

1.  A  tree  is  constructed  and  cost-complexity  pruning 
is  used,  as  in  CART  algorithm,  to  find  the  sequence 
of  optimally  pruned  subtrees  for  any  penalty  q. 

2.  For  each  optimally  pruned  subtree  a  locally  opti¬ 
mal  2, 3, 4, 5, ...  group  recombination  of  the  nodes  is 
found  by  a  K-means  type  clustering  algorithm. 

3.  The  whole  process  is  cross-validated  as  in  CART. 
Therefore,  the  choice  of  number  of  strata  can  also 
be  based  on  an  estimate  of  prediction  that  is  not 
overly  optimistic. 

The  use  of  K-means  like  clustering  to  construct  lo¬ 
cally  optimal  K-ary  splits  of  nominal  covariates  was  in¬ 
vestigated  by  Chou  (1989).  In  addition,  he  proposes 
the  development  of  “compound  nodes”  by  using  the  ter¬ 
minal  nodes  of  a  tree  to  define  a  class  variable.  The 
algorithm  presented  here  implements  such  a  clustering 
scheme  among  all  optimally  pruned  subtrees  with  the 
goal  of  finding  good  tree-based  stratifications.  Another 
tree-based  stratification  technique  was  implemented  by 
Ciampi  et.  al  (1988)  for  survival  data. 
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2  Growing  Trees 

The  data  are  assumed  to  consist  of  a  vector  of  obser¬ 
vations  (yi,Xi)  i  =  observed  from  (T, X)  where 

y  is  tlte  the  response  and  X  is  a  vector  of  covariates 
X  =  (XuX, . Xp). 

Tree  growing  procedures  recursively  split  the  data  and 
the  covariate  space  into  two  groups.  Splits  are  chosen 
based  on  the  reduction  in  the  impurity  of  a  node  or  on 
a  measure  of  dissimilarity  in  response  between  nodes. 
Define  impurity  at  a  node  as  the  expected  loss 

»(t)  =  £[i(y,/i(t))|t]. 

where  /i(<)  minimizes  the  loss  for  node  t.  Let  the  ex¬ 
pected  cost  a  node  t  be 

Rm)  =  i{t)P(t) 

where  P(t)  is  the  probability  of  falling  into  node  t;  an 
estimate  of  R*{t)  is 

«(0=  /  L(Y,m)dFN 

JXeB, 

where  Bt  is  the  region  corresponding  to  node  t,  is 
the  empirical  distribution  function  and  where  frequently 
the  loss  functions  L(Y,n)  =  (y  —  and  L(Y,(t)  = 
Y  log(p)  -  (1  -  V')  log(l  -  /i)  used  for  continuous  and  bi¬ 
nomial  data  respectively.  Here,  I  follow  Clark  and  Preg- 
ibon  (1991)  and  Ciampi  et  al.  (1987)  in  the  use  of  the 
likelihood  function  for  tree-based  models. 

Tree  growing  procedures  calculate  the  reduction  in  im¬ 
purity  at  all  possible  splits  and  choose  a  split  that  maxi¬ 
mizes  the  reduction.  The  splitting  continues  until  a  large 
tree  has  been  grown  with  only  a  few  observations  in  each 
node.  The  entire  partitioning  process  is  usually  repre¬ 
sented  by  a  tree  T.  Let  T  denote  the  terminal  nodes  of 
tree  T. 

A  tree-based  regression  model  can  be  expressed  in 
terms  of  a  partition  function  t{x)  =  t  if  x  £  Bt  where 
Bt  corresponds  to  a  terminal  region,  and  a  decision  rule 
i/(t)  =  0t  where  0t  is  an  estimate  corresponding  to 
that  terminal  node.  Alternatively,  the  model  can  be  ex¬ 
pressed  by  step  function  regression  function 

M(x)  =  €  Bt). 

t£f 

In  the  CART  algorithm,  the  cost-complexity  measure 

R<»(T)  =  5]R(<)-ho!tl, 

(ef 

where  a  is  non-negative  complexity  parameter  and  R{t) 
is  the  estimated  cost  of  node  t  defined  above,  is  used  to 
risscss  the  performance  of  a  tree  bcised  model. 


An  optimally  pruned  subtree  for  any  penalty  q  of  the 
tree  initially  grown  is  T\  if 

R„(T,)=  nun  R„(T'), 

where  “  -<  ”  means  “is  a  subtree  of”,  and  it  is  the  small¬ 
est  optimally  pruned  subtree  if  T\  -<  T"  for  every  opti¬ 
mally  pruned  subtree,  T".  Let  T{q)  denote  the  smallest 
optimally  pruned  subtree  of  T  for  complexity  parameter 
a. 

There  is  an  efficient  algorithm  for  obtaining  T(o)  for 
any  a  called  the  coist  complexity  pruning  algorithm. 
It  consists  of  finding  the  sequence  of  optimally  pruned 
subtrees  by  icpeatly  removing  branches  for  which  the 
average  reduction  in  impurity  per  split  in  the  tree  is 
small.  The  process  yields  a  nested  sequence  of  subtrees 
Tm  ■<  ...  -<Ti  ^  -<Ti  ^To  ,  where  Tm  is  the  root 

node,  and  the  sequence  thresholds  oo  >  q,„  >  ...  >  o;  > 
ai_i  >  ...  >  02  >  »i  >  0-  such  that  for  the  optimally 
pruned  subtree  T{q)  =  T(ai)  =  Ti  for  o;  <  a  <  oi+i 
(BFOS). 

3  K-ary  Stratification 

A  tree-based  K-ary  stratification  will  be  defined  to  be  a 
special  case  of  the  tree  based  model  described  in  .Section 
2.  That  is  we  have  a  partition  function  t(x)  =  t  as  before 
but  now  there  is  also  the  constraint  that  the  decision  rule 
!/(•)  must  have  only  K  values,  where  K  is  smaller  than 
the  number  of  terminal  nodes.  The  tree-based  regression 
model  is  then  reduced  to  a  piece-wise  constant  model 
with  only  K  different  prediction  values. 

One  strategy  is  to  find  the  best  K-ary  stratification 
among  ail  subtrees  of  tree  T.  This  scheme  would  by  \ery 
computationally  demanding  because  of  the  extremely 
large  number  of  possible  subtrees  of  even  a  moderate 
size  tree.  The  proposed  algorithm  restricts  the  search  to 
finding  locally  optimal  recombinations  of  optimal  trees 
obtained  from  the  cost-complexity  pruning  algorithm.  I 
denote  the  K-ary  stratification  of  t  he  optimally  pruned 
subtree  for  parameter  a  by  SniTict)).  Note  that  some 
stratification  of  a  sub-optimal  tree  may  perform  better. 

Chou  (1989)  showed  that  a  necessary  condition  for 
any  K-ary  partition  Aq,  A  =  {<i,  ...,<|j.|}  to 

minimize  the  average  impurity 

K-i 

/  =  '(*<:)/’(«<■) 

t=o 

is  that  t  £  Ak  only  if  k  =  &xgmm d(t, p.(sk))  or  if  P(t)  = 
0,  where  s*  =  (f  €  Ajt}  and  where  d  is  the  divergence 

d(/.,,7)  =  E[L(y,/-i|<)]- E[L(y',MOIO]. 
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which  measures  the  increase  in  expected  loss  when  ft 
is  used  to  represent  Y  instead  of  /i(<).  This  in  general 
takes  the  problem  of  finding  a  optimal  K-ary  partition 
to  polynomial  time  in  N ,  0(N^].  However,  even  with  a 
moderate  number  of  terminal  nodes  a  fast  approximate 
algorithm  is  useful.  A  K-means  like  clustering  will  be 
used  as  Chou  (1989);  the  algorithm  will  always  converge, 
but  may  only  lead  to  a  local  optimum. 

Below  is  an  outline  of  the  K-means  algorithm  that 
is  applied  to  each  optimally  pruned  subtree,  T;,  sub¬ 
scripts  indicating  validation  sample,  and  pruned  subtree 
are  omitted  to  simplify  the  description. 

1.  Pick  some  initial  partition  S°  =  (Ag, ...,  For 

example  order  the  “mean”.s  for  each  node  and  divid¬ 
ing  them  up  into  K  groups  of  approximately  equal 
size. 

2.  Calculate  the  centroids,  /xg,  ...,//k'-i  where 

/ii  =  argmin  V 

3.  Update  the  partition,  5-’“*  =.  ...,  let 

t  6  A{  if 

k  =  arg  mind{t,/x^) 

Steps  2  and  3  are  repeated  until  convergence  of  the 
partition.  The  algorithm  yields  a  sequence  of  locally 
optimal  K-ary  stratifications  Sk(T(q))  for  complexity 
parameter  q;  <  o  <  a+]  ,  /  >  1. 

4  Examples 

In  this  section  I  explore  the  stratification  option  and 
compare  it  to  unstratified  tree-based  models  on  tw'o  sim¬ 
ulated  data  sets.  In  both  cases  10-fold  cross-validation 
was  used  to  calculate  estimates  of  prediction  error.  The 
algorithms  were  implemented  by  modifying  the  tree- 
based  tools  of  Clark  and  Pregibon  (1991)  in  the  S- 
programming  Language. 

4.1  Simple  Regression 

The  regression  function  /(x)  =  2  -|-  2x\  •+•  2x2  +  2^3  was 
used  for  this  example.  The  response  values  were  gener¬ 
ated  as 

J/i  =  /(x>)  +  (, 

where  the  (xi,X2.*3)  were  generated  from  the  Uniform 
(0,  1)  distribution  and  c,  were  generated  from  a  standard 


normal  distribution.  The  sample  size  was  N  —  250.  Fig¬ 
ure  2  summarizes  the  estimated  relative  prediction  errors 
for  the  unstratified  tree  and  2,3,4  and  5  group  stratifica¬ 
tions  for  each  optimally  pruned  subtree.  Relative  predic¬ 
tion  is  the  ratio  of  the  prediction  error  to  the  null  model 
prediction  error.  While  it  is  clear  in  this  situation  that 
the  2,3  and  4  group  stratifications  yield  increaised  pre¬ 
diction  error  the  5  group  stratification  yields  estimated 
prediction  errors  almost  as  small  as  the  unstratified  re¬ 
gression.  However,  in  this  example  the  stratification  does 
not  yield  much  simplification  since  the  unstratified  tree 
that  minimizes  the  cross- validated  estimate  of  prediction 
error  has  7  terminal  nodes. 
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Figure  1:  Example  1  -  Cross- validated  estimates  of 
relative  prediction  error  for  the  unstratified  tree  and 
2,3,4  and  5  group  stratifications. 

4.2  Binary  Response 

Five  hundred  observations  were  generated  from  the 
model 

Binomial(n  =  l,p(x,)  =  exp(/(xi ))/( 1  -f- exp(/(x, ))) 
where  th  function 

/(\)  =  -2sign{xi  >  .5}  -|-2sign{x2  >  .5} 
-2.sign{x3  >  .5}  +  2sign{x4  >  .5.} 

The  (xi ,  Xo,  X3,  X4)  were  generated  from  the  Uniform 
(0, 1)  distribution.  In  this  model  the  conditional  prob¬ 
ability  of  success  given  x  has  only  five  different  val¬ 
ues.  However,  the  the  best  fitting  un  ‘ratified  tree-based 
model  has  13  nodes  (with  sufficient  data  one  would  ex¬ 
pect  a  tree  with  16  nodes,  each  node  corresponding  to 
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an  area  of  constant  success  probability).  Figure  2  shows 
that  either  the  3  or  4  group  stratifications  at  tree  sizes  of 
12  and  13  nodes  perform  similarly  (slightly  smaller  esti¬ 
mated  prediction  error)  for  the  unstratified  tree  with  12 
terminal  nodes.  The  3-group  stratification  is  presented 
in  Figure  3.  Note,  Figure  2  also  shows  that  for  trees 
much  larger  than  the  optimal  size  the  2  and  3  group 
stratification  perform  better  than  the  full  unstratified 
trees. 

The  technique  has  also  been  applied  data  on  prognosis 
after  heart  attacks.  The  data  set  analyzed  included  1780 
subjects  collected  by  the  Specialized  Center  for  Research 
on  Ischemic  Heart  Disease  at  the  University  of  Califor¬ 
nia,  San  Diego.  A  subset  of  this  data  set  was  analyzed 
in  BFOS.  The  unstratified  tree  that  minimizes  the  cross- 
validated  estimate  of  weighted  deviance  had  eight  termi¬ 
nal  nodes  with  a  relative  deviance  of  .883.  However,  the 
stratified  tree  with  oniy  3  prognostic  strata  has  similar 
relative  deviance  of  .889.  The  analysis  is  presented  in 
LeBlanc  (1991,  unpublished  manuscript). 
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nodes 


Figure  2:  Example  2  -  Cross- validated  estimates  of 
relative  deviance  for  the  unstratified  tree  and  2, 3, 4, 5 
group  stratifications. 

The  proposed  procedure  uses  the  K-means  algorithm 
for  all  pruned  subtrees;  another  possibility  would  be  to 
use  a  combined  approach.  For  small  trees  the  optimal  K- 
ary  stratification  could  be  found  by  the  optimal  partition 
theorem  and  for  larger  trees  a  locally  optimal  stratifica¬ 
tion  could  be  found  by  the  K-means  algorithm. 


Figure3;  Example  3  -  3-ary  stratification  tree.  The 

number  of  observations  and  success  probability  for  the 

unstratified  tree  are  given  below  each  node. 
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Abstract 

A  long  standing  problem  in  information  retrieval  is  how  to 
treat  queries  that  are  best  answered  by  two  or  more  distinct 
sets  of  documents.  Existing  methods  average  across  the 
words  or  terms  in  a  user’s  query,  and  consequently, 
perform  poorly  with  multimodal  queries,  such  as:  "Show 
me  documents  about  French  art  and  American  jazz".  We 
propose  a  new  method,  the  Relevance  Density  Method  for 
selecting  documents  relevant  to  a  user’s  query.  The 
method  can  be  used  whenever  the  documents  and  the 
terms  are  represented  by  vectors  in  a  multi-dimensional 
space,  such  that  the  vectors  corresponding  to  documents 
and  terms  dealing  with  closely  related  topics  are  close  to 
each  other.  We  show  that  the  Relevance  Density  Method 
performs  better  for  multimodal  as  well  as  single  mode 
queries  than  an  averaging  method.  In  addition,  we  show 
that  retrieval  is  substantially  faster  for  the  new  method. 

Introduction 

The  task  of  an  information  retrieval  system  is  to  respond  to 
a  user’s  request  for  information  {a  query)  by  searching  a 
collection  of  documents  (e.g  texts  such  as  books,  journal 
articles  etc.)  and  selecting  those  documents  that  seem  to  be 
relevant  to  the  topic(s)  of  the  query.  Usually,  the 
documents  in  the  collection  are  indexed  by  terms 
(keywords).  It  is  assumed  that  the  topic(s)  of  a  document 
or  of  a  query  is  adequately  reflected  by  its  collection  of 
terms. 

The  relevance  density  method  proposed  in  this  paper  can 
be  applied  whenever  terms  and  documents  are  represented 
by  vectors  in  the  same  multidimensional  document-term 
space  with  similarity  of  terms  and  documents  reflected  by 
the  closeness  of  their  vector  representations  in  that  space. 
In  other  words,  if  two  vectors  are  close  together,  then  the 
corresponding  terms  or  documents  can  be  assumed  to  be 
closely  related  in  their  topics  and  vice  versa.  Methods  for 
constructing  such  a  space  are  presented  in  [1]  and  [2]. 

Currently,  the  method  of  selecting  relevant  documents 
used  in  conjunction  with  such  vector  representations  of 
terms  and  documents  is  called  vector  averaging  (VA). 
VA  [2],  [3]  represents  a  query  by  a  single  vector  in  the 


document-term  space.  This  query  vector  is  a  weighted 
average  of  the  term  vectors  used  in  the  query.  Documents 
in  the  collection  are  ranked  by  the  closeness  (measured  by 
the  cosine  or  dot  product)  of  their  vectors  to  the  query 
vector.  The  top  ranking  documents  are  selected  as 
relevant  and  returned  to  the  user. 

Representing  the  query  by  a  single  vector  works  well 
when  the  vectors  of  the  relevant  objects  (documents  and 
terms)  are  clustered  together  in  a  single  region  of  the 
document-term  space,  since  the  center  of  that  region  is  a 
reasonable  estimate  of  the  query’s  content.  However,  if 
the  vectors  of  the  relevant  objects  fall  into  two  or  more 
clusters  separated  by  regions  of  the  space  containing  non- 
relevant  documents,  then  averaging  will  perform  poorly, 
since  it  will  tend  to  retrieve  documents  between  the  two 
clusters  of  relevant  documents.  One  proposed  solution  [4] 
was  to  identify  multimodal  queries  and  split  them  into 
sub-queries.  However,  this  method  was  too 
computationally  expensive  and  has  not  been  used  widely. 

An  additional  drawback  of  vector  averaging  is 
computational  expense.  Typically,  the  query  vector  is 
compared  to  every  document  vector.  If  the  document 
collection  is  large  and  the  dimensionality  of  the 
document-term  space  high,  computational  demands  can  be 
quite  signifleant.  The  proposed  method  can  be 
implemented  using  table  look-up,  thereby  trading  space  for 
time. 

The  Relevance  Density  Method 

We  propose  a  new  method  of  ranking  documents.  The 
Relevance  Density  Method  (RDM)  can  be  used  whenever 
documents  and  terms  are  represented  by  vectors  in  the 
document-term  space. 

We  treat  relevance  as  a  continuous  quantity  and  model  its 
distribution  by  a  probability  density  n(D)  over  the 
document-term  space.  The  documents  in  the  collection  are 
ranked  in  the  order  of  the  height  of  the  density  over  then- 
vector  representations  D.  In  other  words,  the  document 
that  has  the  highest  value  of  7t(D)  is  given  rank  1  etc., 
with  higher  ranks  reflecting  greater  similarity  or  relevaiKe. 
Thus,  this  density  should  be  high  over  areas  of  the 
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document-term  space  containing  vectors  to  relevant 
objects  and  low  over  areas  of  nonrelevant  objects.  If  there 
is  more  than  one  cluster  of  relevant  objects,  then  the 
density  should  be  multimodal. 

To  construct  the  density  Jt(£))  we  will  start  with  a  prior 
density  TioiD)  which  reflects  the  system’s  a  priori  guess 
about  the  user’s  interests.  If  no  prior  information  about 
the  user  is  available,  7to(/))  is  a  constant  and  does  not 
affect  the  ranking.  We  use  Bayes’  rule  to  update  the 
density  when  the  user’s  query  is  received.  As  in  vector 
averaging,  the  query  is  treated  as  a  collection  of  terms 

used  in  the  query.  Let  Q  . be  the  set  of 

vectors  corresponding  to  the  terms  used  in  the  query, 
where  k  is  the  number  of  terms  used  in  the  query.  Then* 
n^{D\Q)=fiQ\D)Tto(D) 

In  some  cases,  relevance  feedback  can  be  obtained  from 
the  user  after  the  initial  query.  The  user  is  presented  with  a 
few  top  ranking  documents  and  asked  which  of  them  s/he 
considers  relevant  to  her/his  query.  If  such  relevance 
feedback  is  available,  it  can  be  used  to  update  n(P ).  Let 

Q\  ='|d’,  . . .  .O'"!  be  the  set  of  vectors  corresponding 

to  the  documents  that  the  user  considered  relevant,  where 
m  is  the  number  of  documents  the  user  considered 
relevant.  Then  the  relevance  density  after  the  feedback  is: 
n2{D\Q,Q,)  =f(.Q\\D)-nfD\Q) 


We  used: 


f{Q\D)=  2wyc(^)exp[bycos(r>)l 
y=i  '' 


where  cos(TJ))  is  the  cosine  of  the  angle  between  the 
term  vector  T  and  document  vector  D .  The  above  density 
has  the  property  of  being  unimodal  when  the  term  vectors 
are  in  a  single  cluster  and  multimodal  when  there  is  more 
than  one  cluster.  This  density  is  a  sum  of  bell-shaped 
components.  The  ith  bell  is  centered  over  the  vector  of  the 
i  th  term  used  in  the  query.  The  bell  is  tall  and  narrow  if 
the  parameter  of  concentration,  bj  is  high,  and  low  and 
wide,  if  bj  is  low.  The  parameter  of  concentration 
differentiates  highly  specific  terms  from  broad,  less 
specific  terms.  For  example,  single  word  terms,  such  as 
cable  tend  to  be  less  specific  than  multi-word  terms,  such 
as  fiberoptic  cable  [3]  [6].  The  factor  c  (bj)  normalizes  the 
i  th  bell  to  integrate  to  1 ,  making  it  a  proper  density.  The 
weights  Wj  can  be  used  to  express  different  amounts  of 
importance  associated  with  terms.  For  example,  words 


I .  To  make  TC]  a  proper  density,  a  scaling  constant  is  needed,  but  since  it 
does  not  affect  the  ranking,  we  will  omit  it. 


can  be  weighted  according  to  their  information  value; 
common  or  frequent  words  are  weighted  less  heavily  than 
rare  words.  (A  list  of  desirable  qualities  of  a  sampling 
density,  proofs  that  / (Q  ID)  has  these  qualities,  and  an 
alternative  sampling  function  are  presented  in  [5].) 

The  values  of  Wj  C(bj)exp^bj  Cos(rj,D)j  can  be  pre¬ 
calculated  for  every  document  and  term  and  stored.  Thus, 
when  a  user’s  query  is  processed,  the  system  simply  looks 
up  the  values  corresponding  to  the  terms  used  in  the  query 
and  adds  them  up  to  compute  f(Q  ID).  This  table  look-up 
method  of  computation  makes  the  RDM  far  less 
computationally  expensive  than  the  VA  in  terms  of  the 
number  of  operations  required  [5].  However,  if  the  term 
by  document  matrix  is  large,  having  enough  space  to  store 
the  values  becomes  an  issue. 

Results  of  Testing 

Both  the  RDM  and  VA  methods  were  tested  on  Bellcore’s 
ADVISOR  system  [3],  [6].  The  system  responds  to  a 
query  by  identifying  departments  within  Bellcore  best 
suited  to  answer  the  query.  (Bellcore  is  a  large  and 
diverse  research  and  development  company.)  At  the  time 
of  the  first  set  of  tests,  the  104  departments  were 
represented  by  abstracts  of  the  technical  papers  they 
produced  in  1987.  There  were  728  such  documents 
indexed  by  7,100  terms  in  the  ADVISOR’S  collection. 
New  abstracts  were  collected  in  1987  and  in  1989  and 
used  as  test  queries.  (We  did  not  use  as  queries  any  of  the 
abstracts  in  ADVISOR’S  collection.)  In  addition,  to  study 
the  performance  in  cases  where  the  query  was  likely  to 
have  at  least  two  separate  topics,  we  constructed  "double" 
queries  by  joining  the  texts  of  pairs  of  abstracts  produced 
by  two  different  departments  and  treating  these  joined 
texts  as  a  single  query. 

The  measure  of  performance  for  each  test  query  was  the 
rank  of  the  first  retrieved  "relevant"  document.  A 
document  was  considered  relevant  to  the  query  if  it  was 
produced  by  the  same  department  as  the  one  that  produced 
the  query.  In  the  case  of  the  double  queries,  the 
documents  produced  by  either  one  of  the  two  departments 
were  considered  relevant.  If  the  method  of  retrieval  were 
perfect,  the  rank  of  the  first  correct  document  would  be  1. 
On  the  other  hand,  if  the  the  documents  were  ranked 
randomly,  the  rank  would  be  on  average  52. 

Each  query  v/as  rartked  by  each  of  the  two  methods,  RDM 
and  VA.  VA  was  used  with  a  root  mean  squared 
weighting  of  the  terms  and  with  the  cosine  as  the  similarity 
measure.  This  weighting  scheme  and  similarity  measure 
were  chosen  because  they  produced  the  best  performance 
in  previous  tests  on  this  collection.  The  RDM  was  used 
with  a  constant  prior  density  (i.e.  no  prior  information), 
with  constant  weights  on  the  terms  and  with  bj=l  for 
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terms  consisting  of  a  single  word  and  bj=2  for  multi-word 
terms. 

The  results  of  these  tests  are  presented  in  Table  1.  We 
observe  that  for  263  new  abstracts  produced  in  1987, 
which  were  used  as  queries,  both  VA  and  RDM  answered 
at  least  25%  of  these  263  queries  correctly  on  the  first  try, 
since  the  lower  quartile  of  the  ranks  of  the  first  correct 
documents  is  1  for  both  methods.  VA  answered  at  least 
50%  of  the  queries  on  or  before  the  third  try  (median 
rank=3),  while  RDM  did  better  with  a  median  rank  of  2. 
Finally,  the  upper  quartile  of  the  ranks  was  19  for  VA  and 
9  for  RDM.  This  indicates  that  RDM  answered  75%  of 
the  queries  correctly  on  or  before  the  9th  try,  whereas  VA 
answered  75%  of  the  queries  correctly  on  or  before  the 
19th  try.  From  the  user’s  point  of  view  there  is  likely  to  be 
a  big  difference  between  looking  at  8  versus  18  non- 
relevant  documents  before  getting  a  relevant  one.  The 
statistical  significance  of  the  differences  in  performance 
was  assessed  using  a  Wilcoxon  Signed  Rank  test.  The 
value  of  the  z  statistic  for  the  263  queries  was  -2.14.  The 
p  value  of  the  test  against  the  two-sided  hypothesis  is 
0.016. 

Similar  comparisons  can  be  made  for  the  two  ranking 
methods  based  on  queries  from  1989  and  on  the  "double” 
queries  from  1987  and  1989.  Both  methods  performed 
better  on  the  1987  queries.  This  is  to  be  expected,  since 
the  work  of  the  departments  represented  in  ADVISOR’S 
database  is  from  1987  documents,  and  undoubtedly 
departments’  emphasis  and  work  have  shifted  in  two 
years. 

The  overall  conclusion  that  can  be  drawn  from  the  data  in 
Table  1  is  that  the  RDM  performed  better  than  the  VA 
(had  the  rank  of  the  first  relevant  document  closer  to  1). 
The  Wilcoxon  test  statistic  ranged  from  highly  significant 
(p  value  <  0.0001)  to  moderately  significant  (p  value  < 
0.018),  but  in  all  4  tests  the  RDM  was  the  superior 
method. 

Recently,  we  compared  the  two  methods  in  terms  of  their 
computational  cost.  We  collected  316  actual  queries 
submitted  by  the  users  at  Bellcore  to  ADVISOR  and  found 
out  how  long  it  took  to  do  the  computations  needed  by 
each  method  for  these  queries.  (We  ignored  the  time  it 
takes  to  do  the  I/O  and  the  sort  of  the  documents  since  this 
is  the  same  for  both  methods.)  The  current  version  of 
ADVISOR  represents  documents  and  terms  by  3(X) 
dimensional  vectors  and  has  1023  documents  in  its 
collection.  The  computations  were  done  on  a  DEC 
5(X)0/200  machine.  The  computation  time  (the  sum  of 
user  and  system  time)  is  plotted  against  the  number  of 
terms  in  the  query  in  Figure  1.  It  is  obvious,  that  VA  took 
substantially  longer  than  RDM.  The  median  of  the  VA 
time  was  0.53  seconds,  of  the  RDM  time  was  0.02 


seconds. 

Conclusions 

The  Relevance  Density  Method  of  ranking  documents  for 
retrieval  was  designed  to  overcome  two  problems  of  the 
currently  used  method.  Vector  Averaging.  These  problems 
are:  (1)  poor  performance  in  the  case  of  multimodal 
queries  and  (2)  high  computational  cost.  The  proposed 
method  was  tested  on  Bellcore’s  ADVISOR  system  and 
performed  faster  and  better  than  Vector  Averaging  in  these 
tests. 
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TABLE  1 

ADVISOR  RESULTS 

1987  Queries 

Method 

Lower  Q 

Median 

Upper  0 

VA 

1 

3 

19 

RDM 

1 

2 

9 

2^i7c 

p  value 

#  of  queries 

-2.14 

0.016 

263 

1987  PAIRS 

Method 

Lower  Q 

Median 

Upper  0 

VA 

1 

5 

24 

RDM 

1 

3 

19 

^wUc 

p  value 

#  of  queries 

-4.20 

0.000 

66 

1989  QUERIES 

Method 

Lower  Q 

Median 

Upper  0 

VA 

2 

8 

51 

RDM 

1 

5 

29 

^wilc 

p  value 

#  of  queries 

-0.920 

0.018 

43 

1989  PAIRS 

Method 

Lower  Q 

Median 

Upper  0 

VA 

3 

10 

31 

RDM  1  6 


Computing  time  vs.  #  terms  in  query. 
ADVISOR 


#  terms  in  query. 

1«vector  averaging,  2>relevance  density 


Z^iic  p  value  #  of  queries 
-0.918  0.018  98 
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1  INTROIIIKTION 

Intiirniatiun  that  rtsiiic.s  in  Iwii  computer  ilala  liases  can 
he  ii.setui  tor  analysis  and  policy  decisions,  t'or  instance,  an 
epidemiologist  might  wish  to  evaluate  the  effect  of  a  new 
canecr  treatment  hy  matching  intormalion  from  a  collei  tion 
ot  meilical  ca.se  studies  against  a  ilealh  index  that  contains 
intormalion  ahoul  the  cause  and  date  of  itealh  An 
economist  might  wish  to  evaluate  energy  policy  ilecisions  by 
matching  a  ilala  base  containing  tuel  and  commodity 
information  tor  a  set  ot  com|ianie.s  against  a  data  liase 
containing  the  values  and  types  ot  gootls  producisl  by  the 
companies  It  uniipie  idcniiliers  such  as  verilied  social 
security  numbers  or  employer  idenlitivalion  numbers  are 
available,  then  matching  data  sources  is  sliaighitorward  and 
standard  methods  of  statistical  aiiaivsis  are  applicable 

It  such  identifiers  arc  not  available,  then  matching  must 
be  perlormed  using  information  such  as  company  or 
individual  name,  address,  age.  and  other  descriptive 
intormalion  l-ven  when  typographical  variation  and  errors 
are  absent,  name  intormation  such  as  Smith'  and  Uoberl' 
may  not  uninuely  identity  an  individual  f'se  ot  address 
intormalion  is  often  sub)ect  to  error  becau.se 
parsing  siandardi/alion  sotlware  do  not  eltecliveiy  allow 
comparison  ot.  say.  a  house  number  wiih  a  house  number 
and  a  street  name  with  a  street  name  I  he  addresses  ot  an 
individual  we  wish  to  match  mav  diller  because  one  is 
erroneous  or  because  the  individual  h.is  moved 

I'ellegi  and  Sunler  (1‘lriy)  presented  a  formal 
mathematical  model  and  showed  the  opiirnalily  of  decision 
rules  in  a  record  linkage  strategy  I’airs  of  records  m  a  file 
are  given  a  .score.  Those  above  a  certain  .score  are 
designated  matches,  those  below  a  second,  lower,  score  are 
designated  nonmalches.  and  those  with  with  scores  between 
the  higher  and  lower  semes  are  held  tor  clerical  review 
The  .scores,  or  computer  matching  weights,  ate  based  on  a 
crucial  likelihood  ratio  that  is  otlen  dillieult  to  estimate  (see 
e  g  .  Winkler  and  riiibaudeau  Helin  and  Hubiii  I'l'lO. 

TWI). 

With  files  of  moderate  si/e.  several  thousand  (lairs  may 
need  to  be  clerically  reviewed  As  such  review  often 
involves  examining  paper  forms  (if  (hev  exist)  or  u.sc  ot 
additional  data  sources,  it  is  expensive  and  subject  to  error 
With  large  tiles,  reviewing  hundreds  of  thousands  of  pairs 
is  likely  to  be  prohibitively  expensive. 

Winkler  and  Scheuren  (l‘t'M)  introduced  a  model  that 
provides  a  means  ot  adpisting  general  regression  analyses 
tor  matching  error  The  rn.iin  |iuipose  ot  the  ad|u.stment 
procedure  is  to  reduce  or  eliminate  the  need  lor  clerical 


review  At  a  minimum,  the  priKcdures  tell  us  how  much 
accuracy  is  improved  via  adjustment,  whether  estimates  arc 
sufficiently  accurate  for  statistical  analyses  and  policy 
decisions,  and  how  much  cost  must  be  incurred  (through 
targetted  clerical  review)  to  insure  a  given  benetil  in 
increa.sed  accuracy  I'hc  key  to  the  adjustment  proceduie  is 
estimating  accurately  the  proportions  of  matches  and 
nonmatches  within  a  set  of  pairs  for  all  ranges  of  scores 
The  method  of  estimating  proportions  of  matches  within 
weight  ranges  is  due  lo  Helm  and  Rubin  (l‘tyil.  IWl) 

The  paper  jiresenls  an  evaluation  ot  the  adjustment 
proceiture  for  ordinary  linear  regression  The  evaluation 
tool  IS  an  extension  of  Rubin's  multiple  imputation  (see  e  g  . 
Rubin  l‘by7.  jip  7S  77)  The  empirical  data  base  is 
constructed  trom  two  tiles  tor  w  hie  h  true  matching  status  of 
pairs  is  known  Very  extensive  review  and  verification  ot 
pairs  was  ilone  to  assure  that  matching  status  is  ace  urate 
Numerical  d.ita  are  constiucicd  using  known  normal  models 
Dillerent  sets  ot  seed  numbers  produce  ditferent  samples 
I'he  intuitive  idea  ot  multiple  imputation  is  that  the 
structure  of  data  relationships  and  the  model  under  which 
we  impute  plaies  leslraints  on  the  statistical  estimates  being 
considered  l  or  nonresponse  (Rubin  1^87).  the  set  of  daia 
values  a.ssocialed  wiih  respondenis,  the  pallern  ot 
nonresjionse.  and  ihe  imputation  model  all  efteci  mulliply 
imputed  parameter  estimates  and  their  variances  l  or  this 
pajier.  what  records  from  one  tile  are  matched  with  what 
records  trom  another  file,  the  data  a.ssociated  with  the 
matched  records,  and  the  model  for  adjusting  for  matching 
error  all  etteci  Ihe  multiply  imputed  estimates 

The  outline  tor  the  remainder  of  the  presentation  is  as 
tollows  In  Ihe  second  section  we  present  some  ot  the 
theoretical  background  In  Ihe  third  section  we  present  brief 
results  The  final  section  consists  of  di.scussion 

HA(  K(;K()IIM) 

7  1  Theoretical  AdiusImenI  Model 

This  seclion  provides  a  de.scriplion  ot  the  regression 
framework  and  adjustment  methodology  for  the  sim|rles| 
cla.sses  of  univariate  regression  The  theory  for  general 
regre.vsion  is  given  by  Winkler  and  Scheuren  (|4‘JI) 

let  Y  =  ,X  ♦  t  be  the  ordinary  univariate  regression 
model  tor  wbicb  error  terms  are  independent  with  constant 
variance  iT.  It  we  were  working  with  a  single  data  base, 
Y  would  Ik  regressed  on  X  in  Ihe  usual  manner  I  or  i  = 
I.  .  N.  we  wish  to  use  (X,.Y,)  but  use  (X,./,)  7,  is 
usually  Y,  but  may  take  some  other  value  due  lo 

matching  eiror 
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For  i  =  1,  •••,  N, 

{Yj  with  probability  pj 

Yj  with  probability  for  j**!. 

Pi  +  S  qu  =  1- 

The  probability  Pj  may  be  zero  or  one.  We  define  hj  = 
1  -  Pj  and  divide  the  set  of  pairs  into  N  mutually 
exclusive  classes.  The  classes  are  determined  by  records 
from  one  of  the  files.  Each  class  consists  of  the  indepen¬ 
dent  x-variable  Xj,  the  true  value  of  the  dependent  y- 
variable,  the  values  of  the  y-variables  from  records  in  the 
second  file  to  which  the  record  in  the  first  file  containing 
Xj  have  been  paired,  and  computer  matching  weights. 
Some  of  the  N  classes  may  have  zero  matching  weights 
By  paired  we  mean  two  records  from  the  two  files  that  have 
been  brought  together  during  the  record  linkage  process  but 
for  which  no  determination  of  matching  status  may  have 
been  made.  Under  an  assumption  of  1-1  matching,  for  each 
i  =  1,  N,  there  exists  at  most  one  j  such  that  qjj  >  0. 
We  let  <J>  be  defined  by  (j)(i)  =  j. 

To  define  regression  properly,  we  need  to  find  jU^=E(z), 
OJ^  and  o„.  We  observe  that 

E(Z)  =  (1/N)  2j  E(Z|i)  =  (1/N)  2.  (Yj  Pj  +  2j.i  Y,  qj^) 

=  (1/N)  li  Yj  +  (1/N)  2;  [Yj  (-h^  +  Y*(o  hj]  =  Y  +  B. 


Similarly,  we  can  represent  a,,  in  terms  of  and 
a  bias  term  B,,  and  o/  in  terms  of  and  a  bias 

Aj  e,  jr 

term  B^^.  We  neither  assume  that  the  bias  terms  have 
expectation  zero  nor  that  they  are  uncorrelated  with  the 
observed  data. 


Different  equations  yield  the  adjustments  that  relate 
regression  coefficients  based  on  observed  data  with 
regression  coefficients  S,,  based  on  true  values.  Our 
assumption  of  1-1  matching  (which  is  not  needed  for  the 
general  theory)  is  done  for  computational  tractability  to 
reduce  the  number  of  records  and  amount  of  information 
that  must  be  tracked  during  the  matching  process. 

In  implementing  the  adjustments,  we  make  two  crucial 
assumptions.  The  first  is  that,  for  i  =  I,  -,  N,  we  can 
accurately  estimate  the  true  probabilities  of  a  match  Pj. 
The  second  is  that,  for  each  i  =  1,  N,  the  true  value  Yj 
associated  with  independent  variable  Xj  is  the  pair  with  the 
highest  matching  weight  and  the  false  value  Y^j)  is 
associated  with  the  second  highest  matching  weight. 


Empirical  Data  Base 


The  empirical  data  base  is  created  from  two  files  of 


10,000  records  having  known  matching  status.  Basic 
matching  parameters  (see  e.g.,  Winkler  and  Thibaudeau 
1990)  are  estimated  that  cause  the  curves  of  log  frequencies 
versus  matching  weight  for  nonmatches  and  matches  to 
separate  (Figure  1).  Matching  probabilities  are  estimated 
using  the  Belin-Rubin  methodology  (Table  1).  We  see  that 
the  estimated  probabilities  agree  quite  closely  in  the  tails 
(above  weight  4  and  below  weight  2).  For  weight  3,  the 
deviation  is  relatively  large  because  the  true  proportion  of 
false  matches  is  0.06  while  the  estimated  one  is  0.20. 


Table  1.  Probabilities  and  Counts 
of  Matches  and  Noninatches 
in  Weight  Ranges 


-  Count  -  Probability 


weight 

Mat 

MM 

true 

est 

11 

6950 

0 

.00 

.00 

10 

785 

0 

.00 

.00 

9 

610 

0 

.00 

.00 

8 

439 

3 

.00 

.00 

7 

250 

4 

.00 

.01 

6 

265 

9 

.03 

.03 

5 

167 

8 

.05 

.06 

4 

89 

6 

.06 

.11 

3 

84 

5 

.06 

.20 

2 

38 

7 

.16 

.31 

1 

33 

34 

.51 

.46 

0 

13 

19 

.59 

.61 

-1 

7 

20 

.74 

.74 

-2 

3 

11 

.79 

.84 

-3 

4 

19 

.83 

.89 

-4 

0 

15 

.99 

.94 

-5 

0 

15 

.99 

.96 

-6 

0 

27 

.99 

.98 

-7 

0 

107 

.99 

.99 

In  the 

first 

column. 

weight 

10  means 

weight 

range 

from  10 

to  11. 

Weight 

ranges 

11  and 

above 

and  -7 

and  below 

are  added  together  separately.  Mat 
is  match  and  MM  is  nonmatch. 


Each  unique  record  in  the  merged  data  files  has  an 
independent  x-variable  that  is  generated  according  to  a 
uniform  distribution  between  1  and  101  and  a  dependent  y- 
variablc  that  is  generated  via  with  a  random  normal 
distribution  such  that  the  slope  is  2  and  the  R-square  value 
is  approximately  0.45.  Error  arises  because  the  observed 
(x,y)-pair  that  is  normally  used  in  computation  has  a  y-value 
from  a  record  to  which  the  record  containing  the  x-value 
was  falsely  matched. 

For  the  analysis  we  consider  only  those  pairs  having 
matching  weights  between  0  and  10  because  all  pairs  above 
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weight  10  are  true  matches.  Pairs  between  0  and  10  contain 
both  true  and  false  matches.  We  do  this  to  determine  how 
much  the  adjustment  improves  the  accuracy  of  the  regres¬ 
sion  analyses  in  situations  for  which  there  is  significant 
matching  error.  If  we  include  pairs  above  weight  10,  then 
then  it  is  more  difficult  to  judge  the  adjustment  process 
because  ordinary  regression  estimates  based  on  observed 
data  and  adjusted  regression  estimates  will  both  be  relatively 
more  accurate. 

In  the  remainder  of  the  paper,  whenever  we  use  true,  we 
will  mean  estimates  based  on  the  true  values.  Similarly, 
when  we  use  observed,  we  mean  estimates  based  on 
observed  data.  Adjusted  will  always  refer  to  estimates 
obtained  via  the  adjustment  methods  of  this  paper. 

3.  RESULTS 

The  results  of  using  the  adjustment  process  are  illustrated 
in  Figure  2.  Figure  2a  provides  a  comparison  of  the  relative 
coefficients  of  variation  of  the  adjusted  procedure  versus  the 
nonadjusted  procedure.  To  get  the  plotted  points,  the 
coefficients  of  variations  (cvs)  computed  via  either 
procedure  are  divided  by  the  true  cv  for  weight  class  8. 
The  results  show  that  both  adjusted  and  nonadjusted 
procedures  yield  approximately  the  same  cv  estimates  and 
that  CVS  decrease  as  sample  size  increases.  The  relative  bias 
of  the  CVS  for  the  adjusted  procedure  is  substantially  lower 
than  the  relative  bias  for  the  nonadjusted  procedure  (Figure 
2b).  The  nonadjusted  procedure  uses  ordinary  linear 
regression  on  the  observed  data  pairs. 

Multiply  imputed  estimates  for  25  samples  (Table  2) 
show  the  relative  cv  estimates  for  both  adjusted  and 
nonadjusted  procedures  are  about  the  same  while  the  higher 
bias  of  the  nonadjusted  procedure  yields  higher  quasi  root 
mean  square  errors  (qmrse).  The  term  qrmse  is  used 
because  we  use  an  estimate  of  the  variance  component  of 
root  mean  square  error  rather  than  the  true  value.  We 
observe  that  for  higher  weight  ranges,  say  between  6  and 
10,  both  the  adjusted  procedure  and  nonadjusted  procedure 
produce  about  the  same  qmrses,  0.056  and  0.058,  resp.  As 
weight  ranges  having  more  enoneous  data  are  included,  say 
between  0  and  10,  qrmse  under  the  adjusted  procedure, 
0.048,  is  substantially  lower  than  under  the  nonadjusted 
procedure,  0.081. 

4.  DISCUSSION 

The  multiple  imputation  procedure  adopted  for  analyzing 
the  adjustment  procedure  was  intended  to  dampen  the 
influence  of  the  regression-variable-creation  procedure. 
Specifically,  as  individual  samples  showed  significant 
variation  from  sample  to  sample,  it  was  difficult  to 
determine  how  much  of  an  improvement  the  adjustment 
procedure  yielded.  Although  not  shown,  the  between 


sample  component  of  the  variance  estimated  via  the  multiple 
imputation  procedure  was  roughly  equal  the  within  sample 
component.  If  we  had  considered  only  individual  samples, 
we  would  have  missed  the  additional  source  of  variation. 


Table  2 .  Comparison  of  Estimates 

Averaged  over  25  Samples 
Coefficient  Estimates 

wgt 


class 

size 

true 

est 

obs 

8 

442 

2.020 

2.018 

2.004 

cv 

0.082 

0.082 

0.082 

qrmse 

0.082 

0.082 

6 

970 

2.015 

2.002 

1.976 

cv 

0.053 

0.056 

0.056 

qrmse 

0.056 

0.058 

4 

1240 

2.010 

2.006 

1.956 

cv 

0.046 

0.048 

0.049 

qrmse 

0.048 

0.055 

2 

1374 

2.005 

2.025 

1.940 

cv 

0.044 

0.047 

0.047 

qrmse 

0.049 

0.056 

0 

1473 

2.007 

1.976 

1.870 

cv 

0.042 

0.046 

0.046 

qrmse 

0.048 

0.081 

Note: 

Weight 

class  2 

means 

those 

pairs  having  weight  above  2 
and  below  9 . 

This  paper  reflects  views  of  the  author  and  not  necessarily 

those  of  the  Census  Bureau. 
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Rgure  1.  Log  of  Frequency  vs  Weight 
Matches  &  Nonmatches 


Figure2a  RdaiveCoell^K  Weight  Estimated  Pnb^  Figijre2b.  Reialive  Bias  vs  Weight  Estimated  Probab^ 
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Abstract 

In  record  linkage,  the  correct  statistical  model 
underlying  a  particular  application  may  present  estimation 
difficulties.  Often,  a  convenient  model  is  substituted  in  place 
of  the  correct  one.  Naturally,  the  substitution  induces  an 
error  and  one  can  only  hope  that  the  error  is  negligible.  This 
paper  compares  two  models  as  they  are  applied  to  the  data 
collected  during  the  1988  Dress  Rehearsal  of  the  Decenial 
Census. 

Key  Words:  Record  linkage;  Decision  rule;  Conditional 
independence  model; 

1  Introduction 

The  paper  compare  three  related  techniques  of  record 
linkage.  Sections  2  to  5  give  a  background  on  record 
linkage,  while  sections  6  to  8  apply  the  3  techniques  to  a 
particular  situation. 

2  Record-Linkage  Rules 

Consider  two  populations  of  individuals:  population  A 
and  population  B  .  Denote  the  individuals  of  these  two 
populations  by  a  and  b  respectively.  A  and  B 

may  have  some  individuals  in  common.  Consider  the  set  of 
all  possible  ordered  pairs  (a,b)  .  This  set  is  the  cartesian 
product  AxB  =  i(o,b)  \  a  e  A,  b  e  B)  and  it  can  be 
divided  into  two  sets:  Af ={(a,ft)  \  a  e  A,  b  e  B,  a=b)  and 
U=\ia,b)  \  a  e  A,  b  €  B,  a*b)  .  The  pairs  in  M  are  the 
links,  while  the  pairs  in  U  are  the  non-links.  Note  that 

Af  n  17  =  0  and  A/  U  f7  =  A  x  B  .  Let  a  be  a 

record  generating  function  on  A  and  let  p  be  a  record 
generating  function  on  B  .  These  two  functions  produce 
a(a)  and  p(b)  ,  the  records  of  a  and  b 

respectively.  y  is  a  comparison  function  over 

a(A)  X  P(B)  if  for  any  individual  a  e.  A  and  for  any 
individual  b  e  B  ,  the  record  a(a)  can  be  compared  to 
the  record  P(6)  through  the  comparison  function 
Y(a(a),p(b))  .  Finally,  the  comparison  space  is  the  set 
r  =  {Y(a(a),p(b))la  e  A,  6  e  Bl  ,  the  set  of  all  possible 
comparison  values. 


In  practice,  the  comparison  function  is  a  vector  valued 
function.  Each  vector  component  Y‘(®(<t)>P0))  >  where 
i=l,...JV  ,  corresponds  to  a  specified  field,  such  as  last 
name  or  age.  Y*(®(fl).P(W))  **  assigned  the  value  0  if  the 
records  of  the  two  indviduals  disagree  over  field  i  and  it 
is  assigned  1  if  they  agree.  The  comparison  space  F  is 
the  set  of  all  binary  vectors  (i.e.  whose  components  are  0  or 
1)  of  dimension  N  . 

Consider  a  particular  comparison  vector  denoted  by  y  *  • 
The  probability  that  a  pair  of  records  ia,b)  gives  rise  to 
Y  *  ,  through  the  comparison  function  y  and  given  that 
the  pair  belongs  to  the  set  of  links  M  ,  is  defined  as 
follows: 

>"(Y*)=  E  P'{y(«(a)Mb})-y'\(a,b)]Pr[ia,b)\M\ 
Similarly, 

«(Y*)=  E  i»^Y(o(aXP(*))=Y*l(a,&)]i’/l(a.b)ll/] 

is  the  probability  that  a  pair  of  records  gives  rise  to  y* 
given  that  the  pair  is  a  non-link. 

The  purpose  of  record-linkage  is  to  determine  which  pairs 
are  the  links.  In  this  respect,  a  decision  rule  is  constructed. 
Let  Aj  be  the  decision  to  declare  a  given  pair  a  link, 
while  Aj  is  the  decision  to  declare  that  same  pair  a 
possible  link,  and  Aj  is  the  decision  to  declare  the  pair  a 
non-link.  Any  one  of  these  three  decision  is  taken  on  the 
basis  of  y  .  It  is  assumed  that  the  comparison  vector  y 
is  sufficient.  In  this  context  a  decision  function  is  a  triplet  of 
probabilities,  d(y)  =  (Fr[A,  lY],P/|A2lyj,Pr|A3lYj)  ,  where 
is  the  probability  of  making  decision  A^  when 
observing  comparison  vector  y  ,,  where  i=l,...,3  . 
Naturally  F^A,1y|  i  0  and  ~  ^ 

definition  of  record  linkage  rule  foiflbws  easily:  A  record 
linkage  rule  (linkage  rule)  is  a  mapping  from  the  comparison 
space  r  onto  a  set  of  decision  function  D  =  {<i[Y)} 
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Two  types  of  enor  may  occur  when  applying  a  linkage  rule. 
The  type  I  error  occurs  whenever  a  pair  declared  a  non-link 
is  in  fact  a  link.  The  type  II  error  occurs  when  a  pair  is 
declared  a  link  but  is  not.  A  linkage  rule  is  said  to  be  a 
linkage  rule  at  the  levels  p  and  A  .where  0<p<l 
and  0<A<1  .if  i»r|A,ll/]  =  p  and  =  . 

Here  iV^,lf/]  is  the  Type  II  error  and  is  the 

type  I  error.  Such  a  linkage  rule  is  denoted  by  . 

Furthermore,  the  rule  is  said  to  be  optimal  at  the 

levels  p  and  A  if  for  any  other  linkage  rule  at  the 
levels  p  and  A  .  denoted  by  L*(p,AJ^  .  the 
following  holds:  s  .  That  is  the 

probability  of  declaring  any  pair  a  possible  link  is  no  greater 
under  rule  L  than  under  rule  L*  .  while  maintaining 
the  same  error  levels. 

3  The  Fellegi-Sunter  Theorem 
Fellegi  and  Sunter  (1969)  formally  show  how  to  constmct  an 
optimal  linkage  rule.  Let  all  the  comparison  vectors  y  be 
ordered  by  decreasing  order  of  the  ratio  m(y)lu(y)  .  If 
there  arc  ties,  order  is  assigned  randomly  among  them.  This 

fif 

ordering  gives  rise  to  a  sequence  {Yj}.’’,  .where  Np  is 
the  total  number  of  comparison  vectors.  For  given  error 
levels  p  and  A  assume  there  exists  n  and  n‘ 
such  that 

<  F  ^ 

j=l  i=l 


E  ^  A  >  E  “(Yi) 


Consider  the  following  linkage  rule; 

n-l 

must  satisfy  ~  ^  ']E  “(f  <)  •  ensures 

the  consistence  of  the  randomissfebn  rule.  A  similar 
constraint  involves  Pj  . 

THEOREM  1  (Fellegi  and  Sunter.  1969);  The  linkage  rule 


(1,0,0)  iin-1 

(P^.l-P^.O)  i=« 

d|Y()  =  (0.1.0)  n<iiji*-l 

(0,1 -P,  A) 

(0,0,1)  iin*  +  l 

defined  in  (1)  is  optimal  at  the  levels  p  ,  A  . 

In  order  to  make  use  of  theorem  1,  the  ratio  m(y)/u{y) 
must  be  known  for  each  observable  value  of  the  comparison 
vector  y  .  Of  course,  in  practice,  those  ratios  are  unknown 
and  must  be  estimated.  To  perform  the  estimation,  a  class  of 
probabilistic  models  is  established.  Then  estimation 
techniques  are  used.  Before  introducing  some  classes  of 
models,  more  notation  must  be  reviewed. 

4  Notation  for  record-linkage 
''w  4m  I*’®  count  of  pairs  with  the 

following  attributes:  whenever  k  =  0  the  corresponding 
pairs  are  non-links  and  whenever  k  =  I  they  are. 
Furthermore,  when  i,  =  0  ,  the  corresponding  pairs  do  not 
exhibit  record  agreement  over  comparison  field  s  and 
whenever  =  1  ,  the  pairs  do  exhibit  record  agreement 
over  comparison  field  s  .  Note  that  s  =  1,...  JV  ,  where  N 
is  the  number  of  comparison  fields. 

It  is  important  to  realize  that  the  counts  cannot  be 

observed.  Rather,  what  is  observed  are  the  aggregated 
counts,  denoted  by  v.  ,  ,  where 

This  notation  is  usefull.  The  next  section  present  a  class  of 
record  linkage  models. 

5  The  Conditional  Independence  Model 
Goodman  (1 974)  gives  a  thorough  analysis  of  the  conditional 
independence  model.  It  is  best  described  by  its  log-linear 
representation: 
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under  the  same  roof  and  therefore  they  are  more  likely  to 
agree  on  the  rest  of  the  houseold  fields. 


There  are  constraints  on  the  parameters  involved  on  the  right 
hand  side  of  (2).  These  can  easily  be  deduced  and  are  left 
out  here. 

The  expression  on  the  right-hand  side  of  (2)  includes  one 
term  corresponding  to  the  effect  of  the  link  status  (link/non¬ 
link)  of  the  counted  pairs  (  )  and  one  term  for  the  effect 

of  each  comparison  field  (  ).  It  also  includes  terms  for 

the  interaction  effects  between  the  link  status  and  the 
fields(  ).  However,  there  are  no  interaction  terms 
between  the  fields  and  this  implies  that,  conditional  on  the 
latent  class,  the  fields  are  independent.  In  many  cases, 
because  of  dependency  relationships,  it  is  necessary  to 
include  interactions  terms  between  certain  fields.  In  those 
case,  selective  models  must  be  used.  The  following  situation 
is  such  a  case. 

6  Applications:  The  St  Louis  data 
These  data  were  collected  in  1988  during  a  dress  rehearsal, 
in  preparation  for  the  Decennial  Census  operations.  Two 
files  were  created,  based  on  two  surveys  of  the  individuals 
living  in  a  defined  geographical  area  within  the  city  of  St. 
Louis.  Those  surveys  are  the  census  and  the  post¬ 
enumeration  survey.  In  both  cases,  for  each  individual 
reported  at  the  time  of  the  survey,  a  record  is  created  and 
various  characteristics  of  the  individual  are  recorded.  The 
objective  is  to  link  the  records  of  the  Census  file  with  the 
records  of  the  Post  Enumeration  Survey.  The  comparison 
fields  are  indexed  1  to  1 1  and  are  in  order:  Surname,  house 
no.,  street  name,  phone  number,  first  name,  middle  initial, 
marital  status,  age,  race,  sex  and  relationship  with  the 
respondent. 

7  Two  Models  for  the  St  Louis  data 
In  the  case  of  the  St.  Louis  data  there  are  dependencies 
between  some  fields.  Particularly  among  the  non-links, 
between  the  houseold  fields.  These  are  surname,  street  no., 
street  name  and  telephone.  When  two  individuals  agree  on 
some  of  the  houseold  fields,  they  are  more  likely  to  be  living 


In  this  section,  two  explanatory  models  are  proposed  for  the 
St.  Louis  data.  The  first  model  is  the  conditional 
independence  model  in  (2)  with  ^  =  II  .  The  second 
model  includes  interaction  terms  between  the  houseold 
variables.  The  log-linear  representation  of  this  model  is 
similar  to  that  in  (2),  but  with  the  addition  of  2-nd,  3-rd  and 
4-th  order  interaction  terms  between  the  household  fields, 
among  the  non-links. 

8  Linkage  Performances 

In  this  section,  the  models  are  fitted  to  the  St-louis  data.  The 
Fellegi-Sunter  rule  is  applied  under  the  conditional 
independence  model  and  under  the  model  with  interactions 
between  the  houseold  fields. 

In  section  2,  the  Type  II  error  is  defined  as  the  proportion  of 
non-links  actually  declared  links.  For  the  St.  Louis  data  it  is 
known  which  pairs  are  the  links  a  priori.  This  information 
was  obtained  through  tedious  follow-up  operations.  With  this 
information,  the  Type  2  error  can  be  controled  when 
applying  the  Fellegi-Sunter  decision  rule. 

There  are  exactly  9823  links  among  the  pairs.  Table  1 
contains  the  number  of  links  that  were  actually  recovered, 
applying  the  Fellegi-Sunter  rule,  under  the  2  models 
presented  previously  and  under  an  ad-hoc  model,  for  3 
different  controled  Type  II  errors.  The  ad-hoc  model  is 
based  on  informal  advice  from  W.  E.  Winkler  (1989).  The 
principle  behind  it  is  to  improve  the  performance  of  the 
conditional  independence  model  by  adusting  its  parameters, 
rather  than  using  a  more  elaborate  model.  The  adjustments 
are  largely  based  on  experience  and  past  knowledge  of 
similar  process.  One  such  adjustment  for  example,  is  the 
inaease  of  the  value  of  the  term  corresponding  to  the  effect 
of  the  first  name  in  (2)  to  ensure  that  pairs  of  records 
agreeing  on  the  first  name  be  weighted  heavily.  The 
advantages  of  this  method  is  that  it  does  not  require 
estimation  procedures  beyond  those  used  for  the  conditional 
independence  model. 


418  Y.  Thibaudeau 


Clearly,  the  model  including  interactions  is  the  best  when  the 
tolerated  error  is  at  its  smallest  (.01).  At  that  error  level  the 
conditional  independence  model  is  poor,  but  the  ad-hoc 
model  does  fairly  well.  If  the  tolerated  error  goes  up  to  .02, 
then  both  the  independence  model  and  the  ad-hoc  model 
catch-up  on  the  model  with  interactions.  This  trend  continues 
as  the  error  is  allowed  to  climb  to  .03.  At  that  point  the 
independence  model  is  only  36  links  behind  the  model  with 
interactions,  whereas  the  ad-hoc  model  and  the  model  with 
interactions  are  virtually  the  same. 

9  Conclusion 

The  model  with  interactions  clearly  gives  the  best 
performance  when  the  tolerated  Type  II  error  is  small.  When 
the  tolerance  on  the  type  II  error  is  relaxed,  the  other 
methods  may  be  just  as  good,  especially  the  ad-hoc 
procedure,  in  this  type  of  situation. 

Table  1;  Links  Recovered  For  Tht^ee  Error  Levels 


Independence 

Interactions 

Ad-hoc 

Error 

.01 

.01 

.01 

Links 

7273 

9712 

9562 

Pairs 

7346 

9808 

9659 

Error 

.02 

.02 

.02 

Links 

9636 

9758 

9765 

Pairs 

9824 

9952 

9960 

Error 

.03 

.03 

.03 

Links 

9740 

9776 

9783 

Pairs 

10038 

10062 

10097 
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Abstract 


2  Optimal  Allocation 


This  paper  derives  an  expression  for  the  optimum  sam¬ 
pling  allocation  under  the  minimum  variance  criterion 
of  the  estimated  attributable  risk  for  case-control  stud¬ 
ies.  Various  optimal  strategies  are  examined  using  al¬ 
ternative  exposure-specific  disease  rates. 

KEY  WORDS:  Odd  Ratio,  Relative  Risk  and  At¬ 
tributable  Risk. 


1  Introduction 

Mullooly  (1987)  derived  expressions  for  the  optimal 
number  of  cases  and  controls  that  minimize  the  total 
sample  size  and  ensure  the  required  level  of  precision 
for  exposure-specific  disease  rates.  Unequal  sample  al¬ 
location  rule  for  various  types  of  clinical  studies  was 
discussed  by  Gail  et.  al  (1976).  They  suggested  the  so- 
called  ‘square  root  rule’  to  the  case  when  the  response 
variable  has  a  different  variance  in  each  group.  They 
presented  various  techniques  to  determine  the  optimal 
number  of  subjects.  Brittain  and  Schlesselman  (1982) 
examined  the  problem  of  optimal  allocation  for  com¬ 
paring  proportions,  pi  and  p2,  in  two  groups  of  clinical 
trial  or  follow-up  studies.  The  criterion  chosen  was  the 
precision  of  the  estimator.  In  a  series  of  papers,  Walter 
(1975,  1976,  and  1978)  discussed  the  estimation  proce¬ 
dures  for  estimating  attributable  risk  and  its  role  in  epi- 
demoilogical  research.  Walter  and  Morgenstern  (1985) 
stressed  the  importance  of  optimal  sampling  plan  which 
is  escentially  dependent  on  the  choice  of  measures  for 
summarizing  the  data.  The  purpose  of  this  paper  is  to 
derive  expression  for  optimal  strategies  in  determining 
allocation  rules  under  the  minimum  variance  criterion 
of  the  estimated  variance  of  attributable  risk  for  case- 
control  studies. 


The  odd  ratio  as  an  approximation  to  the  Relative  Risk 
of  disease  in  a  group  of  people  exposed  to  a  certain 
risk  factor,  compared  to  those  not  exposed,  has  been 
widely  used  since  its  introduction  by  Cornfield  (1951). 
Epidemiologists  and  public  health  officials  suggested 
the  so-called  ‘Attributable  Risk.’  The  measure  of  ‘At¬ 
tributable  Risk’  suggests  the  potential  impact  on  disease 
frequency  of  eliminating  the  exposure  in  the  population. 

Consider  the  following  2x2  contingency  Table  1  for 
possible  association  between  a  dichotomous  study  factor 
(A  =  exposed  or  unexposed)  and  a  dichotomous  disease 
outcome  (B). 

Table  1.  Data  Layout 


A 

D 

Cases  Controls 

Exposed 

Unexposed 

oi  6i 

Qo  ^0 

Total 

ni  n2 

n 

Denman  and  Schlesselman  (1983)  estimated  the  at¬ 
tributable  risk  which  is  given  by 

-  _  ai6o  —  iioo 

ni6, 

which  can  be  expressed  as  follows: 

A  =  Pi-^P2  (2) 

where  pi  =  ^,P2  =  ^  such  that 

Pi  +  9i  =  1.  and  P2  +  92  =  1- 


It  is  assumed  that  pi  and  p2  are  independently  bino- 
mially  distributed.  Walter  (1976)  has  shown  that  the 
variance  of  A  may  be  estimated  by 


ai 


6i 


aoHi  boti2 


(3) 
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The  equation  (3)  can  be  rewritten  aa 

iiV  Pi  ,  P2  ] 

q-i)  L"^(l-Pi)  '»(!  -  -P3)J 

(4) 

where  F  denotes  the  proportion  of  the  total  sample  sub¬ 
jects  which  are  assigned  to  group  1  (cases),  i.e.,  F  = 
Minimization  of  Var  (A)  requires  differentiating 
equation  (4)  partially  with  respect  to  F  and  then  equat¬ 
ing  to  zero.  Solving,  one  gets 


F  = 


(5) 


Leung,  H.  M.  and  Kupper,  L.L.  (1981).  Comparisons 
of  confidence  intervals  for  attributable  risk.  Uio- 
metrics  37,  293-302. 

Mullooly,  J.P.  (1987).  Sample  sizes  for  estimation 
of  exposure-specific  disease  rates  in  population- 
based  case-control  studies.  >4merican  Journal  of 
Epidemiology  125,  1079-1084. 

Pentico,  D.W.  (1981).  On  the  determination  and  use 
of  optimal  sizes  for  estimating  the  difference  in 
means.  The  American  Statistician  35,  40-42. 

Walter,  S.D.  (1975).  The  distribution  of  Levin  s  mea¬ 
sure  of  attributable  risk.  Biometnka  62,  371-375. 


The  optimal  sample  size  allocation  to  cases  and  con¬ 
trols  can  be  obtained  by  using  Equation  (5)  for  various 
combinations  of  pi  and  pj. 


Walter,  S.D.  (1976).  The  estimation  and  interpreta¬ 
tion  of  attributable  risk  in  health  research.  Dio- 
metrics  32,  829-849. 


3  Concluding  Remarks 

It  is  often  required  to  choose  a  particular  combination  of 
ni  and  nj  that  maximizes  the  precision  of  the  estima¬ 
tor.  Walter  and  Morgenstern  (1985)  emphasized  that 
the  optimal  sampling  strategy  depends  on  the  choice  of 
function  of  pi  and  p^-  For  example,  Mullooly  (1987)  dis¬ 
cussed  the  optimum  sampling  strategies  based  on  pre¬ 
cise  estimation  of  disease  rate  in  the  exposed  popula¬ 
tion.  However,  expression  (5)  minimizes  the  variance 
of  attributable  risk  given  by  equation  (4)  for  a  given 
fixed  total  number  of  subjects.  The  choice  of  estima¬ 
tor  for  summarizing  the  data  dictates  the  appropriate 
allocation  rule. 


Walter,  S.D.  (1978).  Calculation  of  attributable  risks 
from  epidemiological  data.  International  Journal 
of  Epidemiology  7,  175-182. 

Walter,  S.D.  and  Morgenstern,  H.  (1985).  A  note  on 
optimal  sampling  for  the  comparison  of  propor¬ 
tions  or  rates.  5fafisfics  tn  Medicine  4,  541-542. 
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Abstract 

The  problem  of  allocating  patients  in  a  two  treatment 
clinical  trial  with  dichotomous  response  is  considered. 
The  trial  goal  is  to  determine  the  better  treatment  while 
incurring  as  few  patient  losses  as  possible.  Several  alloca¬ 
tion  rules  are  compared  and  it  is  found  that  Aanrftt  strate¬ 
gies  perform  well  on  both  criteria  in  that  they  achieve 
nearly  optimal  power  while  keeping  expected  trial  fail¬ 
ures  nearly  minimal.  The  rules  are  also  evaluated  ac¬ 
cording  to  their  computational  complexity. 

1  Introduction 

Researchers  designing  clinical  trials  often  encounter  dif¬ 
ficulties  when  trying  to  determine  the  best  way  to  allo¬ 
cate  patients  to  treatments  so  that  trial  goals  may  be 
achieved  and  the  costs  to  all  concerned  kept  at  a  mini¬ 
mum.  Conventional  designs,  in  which  subjects  are  allo¬ 
cated  to  groups  in  equal  or  predetermined  proportions, 
have  good  decision  making  properties  but  lack  the  flexi¬ 
bility  to  incorporate  other  desirable  design  goals.  Adap¬ 
tive  designs,  in  which  allocation  strategies  may  depend 
on  data  observed  during  the  trial,  have  more  flexibility. 
The  consideration  of  adaptive  techniques  raises  the  ques¬ 
tion  of  what  an  optimal  allocation  rule  is  for  a  problem 
where  statistical  merit  is  not  the  only  measure  of  the 
quality  of  a  design.  This  question  is  complex  and  in¬ 
triguing,  and  it  deserves  more  attention  than  it  is  given 
here,  where  only  a  simple  trial  set-up  is  examined.  What 
we  can  show,  however,  is  that  adaptive  designs  based  on 
optimal  strategies  for  bandit  problems  perform  well  ac¬ 
cording  to  multiple  criteria,  which  include  but  are  not 
restricted  to  the  ability  to  make  a  good  terminal  deci¬ 
sion.  In  particular,  these  rules  are  evaluated  according 
to  ethical  and  computational  criteria  and  then  compared 
with  standard  fixed  allocation  techniques. 

Now,  consider  a  clinical  trial  in  which  we  wish  to  com¬ 
pare  two  treatments  and  determine,  if  possible,  which 
has  the  higher  efficacy  rate.  The  patients,  who  enter  the 
trial  sequentially,  are  to  be  allocated  to  one  of  the  two 

’  Research  supported  in  part  by  National  Science  foundation 
under  grant  DMS-891 4.328. 

*  Research  supported  in  part  by  National  Science  Founda- 
tion/DARPA  under  grant  CCR-9004727. 


therapies  in  such  a  way  that  trial  goals  are  met  as  well 
as  possible.  While  any  complete  description  of  a  clinical 
trial  design  should  address  all  aspects  of  trial  protocol 
(e.g.,  eligibility  criteria,  interpretation  of  responses,  data 
analysis,  etc.),  we  focus  on  the  effects  of  changing  allo¬ 
cation  rules  within  otherwise  fully  specified  designs. 

It  is  assumed  that  the  sample  size  for  the  trial  is  a  fixed 
number,  n,  but  that  the  sample  sizes  for  the  treatment 
groups,  71]  for  T)  and  n2  for  T2,  may  be  random.  The 
response  variables,  X  and  Y  from  T\  and  T2  respectively, 
are  independent  Bernoulli  random  variables  such  that 

(1)  A,,A'2,  -  ~B(1,P,):  71,72, - fl(l,P2) 

where  (Pi, P2)  €  for  fi  =  (0, 1)  x  (0, 1). 

An  allocation  rule,  7,  is  defined  to  be  a  sequence 
(71. •••.7n)  such  that, 

{0,  if  Ti  is  used  for  patient  i\  .  _  . 

1,  if  T2  is  used  at  patient  i,  * 

It  is  required  that  the  decision,  7,  at  stage  i,  depend  only 
on  the  information  available  at  that  time. 

The  parameter  of  interest  is  the  mean  difference  in 
responses,  A  =  P2  —  Pj ,  and  T\  is  said  to  be  superior  to 
T2ifA  >  0,  and  inferior  if  A  <  0.  The  terminal  decision 
rule  depends  on  the  maximum  likelihood  estimate  for  A 
which,  after  n  observations,  is  given  by 

An  —  An(7)  “  ^  na  A^, 
where  ni  =  71 -F  . . . -F  7„,  n2  =  n  — ni,  and 

An,  =  —  7;  Aj;  7n,  =  —  S"-i  (1  —  7j) 

2  Design  Characteristics 

With  the  primary  goal  being  to  select  the  better  of  two 
competing  therapies,  the  decision  rule  has  been  formu¬ 
lated  to  test  the  hypothesis 

(2)  //o  :  A  <  0  vs.  Hi  :  A  >  0, 

and  it  specifies 

Reject  Ho  if  A„  >  0; 

(3)  No  decision  if  An  =  0; 

Fail  to  reject  //q  if  An  <  0. 
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An  informative  measure  of  how  well  a  test  performs  is 
given  by  its  power.  For  this  problem,  the  power  is  sim¬ 
ply  the  probability,  as  a  function  of  P  €  fl,  of  correctly 
identifying  the  superior  treatment.  In  practice,  a  rule 
allowing  the  no  decision  option  should  not  be  used  with¬ 
out  a  null  hypothesis  of  equality  and  corresponding  ac¬ 
ceptance  region.  We  would  prefer,  in  fact,  a  test  that 
not  only  recognizes  similar  treatment  effects  with  high 
probability,  but  also  one  that  has  maximum  power  at  the 
smallest  clinically  significant  difference  between  the  pa¬ 
rameters.  The  testing  regions  here,  however,  have  been 
established  so  that  we  may  study  the  behavior  of  the  al¬ 
location  rules  over  the  entire  parameter  space  and  obtain 
lower  bounds  for  the  power  of  (3).  In  [3],  we  examine 
problems  incorporating  both  type  I  and  II  errors. 

It  is  not  difficult  to  show  that,  for  any  P  €  the 
probability  of  making  an  incorrect  decision  based  on  (3) 
is  minimized  by  allocating  patients  to  therapies  in  equal 
proportions.  This  may  be  achieved  via  alternating  as¬ 
signments  or  by  constrained  or  blocked  randomization. 
Since  an  equal  allocation  rule  guarantees  that  fully  half 
of  the  patients  are  assigned  to  the  inferior  treatment,  de¬ 
signs  utilizing  them  tend  to  incur  more  failures  than  may 
be  necessary  for  the  decision  process.  Our  evaluations  of 
allocation  rules  are  based  on  three  criteria; 

1.  The  probability  of  making  a  ‘correct’  decision  at  the 
end  of  the  trial, 

2.  The  expected  number  of  failures  during  the  trial, 

3.  The  complexity  of  the  computations  required  to  uti¬ 
lize  the  design. 

Due  to  space  limitations,  the  manner  in  which  these  cri¬ 
teria  are  assessed  is  quite  simplistic.  While  each  of  these 
items  can  be  viewed  from  many  angles,  the  results  (Sec¬ 
tion  4)  seem  to  be  representative  of  the  behavior  of  the 
allocation  rules  in  more  general  settings  as  well. 

2.1  Bandit  Problems 

The  sampling  plans  that  we  propose  are  based  on  opti- 
meil  rules  for  multi-armed  bandit  problems.  In  a  bandit 
problem,  the  goal  is  to  maximize  the  sum  of  weighted 
outcomes  arising  from  a  sequence  of  experiments  from 
arms  whose  outcomes  follow  the  laws  of  a  specified 
Bayesian  model.  A  bandit  allocation  rule  is  thus  one 
that  utilizes  prior  information  on  unknown  parameters 
together  with  incoming  data  to  determine  optimal  selec¬ 
tions  at  each  stage  of  the  experiment.  The  weighting  of 
returns  is  known  as  discounting  and  it  consists  of  multi¬ 
plying  the  payoff  of  each  observation  by  the  correspond¬ 
ing  element  of  a  discount  sequence.  The  properties  of 
any  given  bandit  allocation  rule  will  depend  upon  the 
associated  discount  sequence  and  prior  distribution. 


Here  we  have  only  a  two-armed  bandit  (TAB),  but 
these  techniques  generalize  easily  to  problems  with  sev¬ 
eral  arms.  Let  the  outcomes  for  the  two  treatment  arms 
be  given  by  (1),  and  model  the  prior  information  on  the 
success  rates,  pi,p2,  as  independent  beta  distributions 

Pi  ~  Be(ao,6o)  and  p2  ~  Be(co,do). 

At  any  stage  m  <  n,  the  posteriors  for  pi  and  p2  are 
(4)  (pi  ~  Be(a,6);  (p2  \k,i,j)  ~  Be(c,d) 

where  it  =  E™  ,7.,  i  =  and 

a  =  i  -f  floi  b  =  k  —  i  -i-  bo, 

c  =  j  +  Co,  d=m-k-j  +  do. 

The  posterior  means  of  pi  and  p2  at  m  are  simply 
Em[Pi]  =  a/(a -f  6)  and  Em[P2]  =  c/(c -i- d),  where 
Em  denotes  expectation  in  the  model  (4). 

Typically,  the  choice  of  a  prior  distribution  will  de¬ 
pend,  somewhat  subjectively,  on  the  knowledge  of  the 
investigator  preceding  the  trial.  We  use  independent 
uniform  priors  here,  ao  =  bo  =  Co  =  do  =  1,  because 
they  contain  no  initial  bias  and  little  information,  and 
because  the  parameters  of  the  beta  posteriors  concisely 
summarize  the  relevant  study  data  to  date. 

It  is  worthwhile  to  note  that  these  allocation  rules, 
which  arise  within  a  Bayesian  framework,  are  being  eval¬ 
uated  according  to  frequentist  standards.  In  Section  4, 
the  Bayesian  design  is  seen  to  have  had  little  effect  on  the 
results  of  the  trial  from  this  viewpoint.  However,  if  de¬ 
sired,  the  design  may  be  set  up  to  impact  the  trial  and  its 
results  more  heavily,  since  investigators  can  strengthen 
and/or  bias  the  parameters  of  the  beta  distributions  to 
reflect  a  preferred  level  of  information. 

2.2  Ethical  Criteria 

An  advantage  of  using  bandit  problems  to  model  clinic^d 
trials  is  that  elements  of  the  discount  sequence  can  be 
selected  to  represent  an  ethical  decision  regarding  the 
relative  importance  of  the  patient  outcomes  both  dur¬ 
ing  the  trial  and  in  the  future.  At  each  stage  of  the 
sequential  decision  process,  a  bandit  allocation  rule  is 
a  function  both  of  the  effort  to  gather  information  and 
of  the  effort  to  gain  immediate  reward.  Here,  we  con¬ 
sider  two  discount  sequences,  {\,  0^,02^  ■  ■■  ,0n}-  n- 
bonzon  uniform  sequence  with  /?;  =  I,  1  =  1, . . . ,  n,  and 
(he  ^fomc/nr  sequence,  { 1, 0  < /?  <  1. 

In  the  uniform,  finite  horizon  case,  the  optimal  strat¬ 
egy  will  l)egin  by  emphasizing  the  gathering  of  informa¬ 
tion  with  the  result  being  that  the  first  patients  will  be 
treated  rather  like  patients  in  an  equal  allocation  trial 
where  one  assumes  throughout  that  the  treatments  of¬ 
fer  the  same  prognosis.  Toward  the  end  of  the  study, 
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with  a  decision  imminent,  the  emphasis  on  immediate 
reward  is  increased  until,  at  the  leist  stage,  a  completely 
myopic  rule  is  used.  In  the  geometric  case,  it  is  assumed 
that  that  there  will  always  be  more  patients,  so  the  need 
for  information  is  never  completely  absent  as  in  the  last 
stage  of  a  finite  horizon  problem.  However,  as  more  and 
more  patients  are  treated,  the  need  to  sacrifice  imme¬ 
diate  reward  to  gain  information  will  decrease.  Since 
the  sample  size  in  the  present  problem  is  fixed  at  n,  we 
truncate  the  allocations  after  n  observations.  Thus  ban¬ 
dit  ^Jlocation  strategies  for  problems  with  geometric  dis¬ 
counting  are  not  exactly  optimal  for  the  truncated  case. 
As  we  see,  however,  these  rules  still  provide  good  model 
strategies  for  the  problem  at  hand.  See  Hardwick  [2] 
for  further  discussion  of  the  incorporation  of  geometric 
bandit  strategies  in  clinical  trial  designs. 


pected  number  of  failures  remaining  in  the  trial,  if  m 
patients  have  already  been  treated  and  there  were  t  suc¬ 
cesses  and  j  failures  on  T\ ,  and  k  successes  and  I  failures 
on  T2.  (Note  that  one  parameter  can  be  eliminated  since 
m  =  i  +  j  +  k  +  l.)  The  algorithmic  approach  is  based  on 
the  observation  that  if  Ti  were  used  on  the  next  patient, 
then  the  expected  number  of  failures  for  patients  m  -f- 1 
through  n  would  be 

=  Em[pi]  ■  0  + 

Em[l  -  Pi]  ■  (1  +  +  1,  ^,0) 

while  if  T2  were  used  then  we  would  get 

=  EmfPz]  •  +  +  1.0  + 

Em  [1  -  P2]  ■  (  1  +  ^m  +  1  (i,  j,k,l+l)). 

Therefore  T  satisfies  the  recurrence 


2.3  Computational  Criteria 

Ethical  attributes  aside,  an  experimental  design  must  be 
straightforward  to  carry  out  if  it  is  to  be  useful.  For  com¬ 
putational  purposes,  this  means  that  the  rules  should 
use  reasonable  amounts  of  time  and  space  (memory), 
and  be  sufficiently  easy  to  program.  We  distinguish  here 
between  the  computational  requirements  to  set  design 
parameters  and  those  needed  to  carry  out  the  trial.  In 
general  the  former  will  be  significantly  greater  than  the 
latter,  but  can  be  carried  out  on  large  computers  with¬ 
out  significant  deadline  pressure.  The  latter  may  require 
timely  response,  and  may  often  be  performed  on  personal 
computers.  The  latter  will  be  analyzed  here  in  the  next 
section,  while  the  former  will  be  discussed  in  [3]. 

3  Allocation  Rules 

The  following  three  allocation  rules  were  evaluated  with 
respect  to  the  given  criteria: 

TAA  =  Truncated  Alternating  Allocation, 

UB  =  Uniform  Bandit,  and 

TGLB  =  Truncated  Gittins  Lower  Bound. 

The  “truncation”  in  TAA  and  TGLB  refers  to  a  rule 
whereby,  if  a  state  is  reached  such  that  the  final  decision 
can  not  be  influenced  by  any  further  outcomes,  then  the 
treatment  with  the  best  success  rate  will  be  used  for  all 
further  patients. 

3.1  Uniform  Bandit 

By  definition,  the  n-horizon  uniform  TAB  uses  prior  and 
accumulated  information  to  minimize  the  number  of  fail¬ 
ures  during  the  trial.  We  can  determine  the  optimal 
strategy  for  this  bandit  problem  using  dynamic  program¬ 
ming.  Let  Tm{i,j,k,l)  denote  the  minimal  possible  ex¬ 


which  can  be  solved  by  dynamic  programming,  starting 
with  patient  n  and  proceeding  toward  the  first  patient. 

For  the  m**’  patient  there  are  0(m^)  possible  values 
of  so  to  evaluate  all  possible  combinations  of  m, 

i,  j,  k,  and  /  requires  0(n‘’)  computations.  A  clever  im¬ 
plementation  might  not  evaluate  all  possible  values,  but 
a  straightforward  implementation,  as  used  here,  needs 
to  do  so,  and  empirical  evidence  indicates  that,  in  fact, 
0(ti'’)  values  must  be  computed.  The  space  require¬ 
ments  can  be  kept  at  0(n^)  (see  [3]). 


3.2  Gittins  Lower  Bound 

According  to  a  theorem  of  Gittins  and  Jones  [1],  for  ban¬ 
dit  problems  with  geometric  discount  and  independent 
arms,  for  each  arm  there  exists  an  index  with  the  prop¬ 
erty  that,  at  any  given  stage,  it  is  optimal  to  select,  at 
the  neit  stage,  the  arm  with  the  higher  index.  The  index 
for  an  arm,  the  Gittins  Index,  is  a  function  only  of  the 
posterior  distribution  and  the  discount  factor  0.  While 
the  existence  of  the  Gittins  Index  removes  many  compu¬ 
tational  difficulties  associated  with  other  bandit  prob¬ 
lems,  the  only  known  technique  for  computing  the  in¬ 
dex  involves  an  iterated  dynamic  programming  approach 
which  is  computationally  intensive  when  0  is  close  to  1 
(see  [1]).  Unfortunately,  these  are  the  0  values  needed 
to  produce  tests  of  suitable  power. 

Here  we  show  that  very  good  results  can  be  achieved 
by  utilizing  an  easily  computed  approximation.  For  an 
arm  with  posterior  distribution  Be(a,6),  a  lower  bound 
for  the  Gittins  Index  is  given  by  (see  [1,2]) 


r(«-n)  /iV  ,31  rca+O 
(^+^+I)  =  r(a+»+i+l 


r(«) 

r(a+/') 


Qi  Ho  +  ’-U 
^  r(fl+6+.i 
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Parameters 

A  =  0.1 

A  =  0.3 

1 

Criteria 

1  TAA 

TGLB 

UB 

TAA 

TGLB 

UB 

n=20 

Power 

0  =  .999 

Average  Failures 

9.947 

9.774 

9.768 

■aEiiKM 

8.217 

n=50 

Power 

0.760 

0.754 

0.708 

0.985 

0.982 

0.947 

0  =  .9999 

Average  Failures 

24.828 

24.148 

24.117 

23.489 

19.673 

19.214 

n=100 

Power 

0.841 

0.835 

0.771 

0.999 

0.996 

0.980 

0  =  .99999 

Average  Failures 

49.614 

47.779 

47.642 

46.762 

38.051 

36.984 

Power 

0.885 

0.811 

■■mM 

0.998 

0.989 

0  =  .999999 

Average  Failures 

74.393 

71.243 

70.890 

56.367 

54.611 

Table  1:  Comparisons  of  Discrimination  and  Ethical  Criteria 


Because  Ar  is  a  unimodal  function  of  r,  the  best  such 
lower  bound  is  A^*,  where  r*  =  min{r  ;  Ar  —  A^+i  >  0}. 
Each  A,,  can  be  computed  from  the  previous  one  in  a 
constant  number  of  steps,  so  the  total  time  to  compute 
the  best  lower  bound  is  proportional  to  r’  +  1. 

The  computational  requirements  of  the  TGLR  ap- 
proach  are  dithcult  to  analyze  since  they  depend  upon 
the  value  of  r*  and  upon  the  successes  and  failures  en¬ 
countered.  In  the  simplest  implementation,  the  approx¬ 
imate  indices  for  both  treatments  are  computed  at  each 
stage  and  compared  to  determine  the  best  choice.  How¬ 
ever,  computation  can  be  saved  by  noting  that  a  “pl^y 
the  winner”  property  holds,  in  that  if  the  indices  resulted 
in  treatment  i  being  chosen  for  the  previous  patient,  and 
the  outcome  was  a  success,  then  they  will  again  choose 
treatment  i.  Therefore  an  index  needs  to  be  computed 
only  when  a  failure  has  occurred,  and  then  only  for  the 
treatment  that  failed  since  the  posterior  distribution  of 
the  other  treatment  is  unchanged. 


4  Results 


Parameters 

UB 

TGLB 

0 

A  =  0.1 

A  =  0.3 

20 

0.999 

8,855 

180 

174 

50 

0.9999 

292,825 

611 

597 

100 

0.99999 

4,421,275 

1,705 

1,687 

150 

0.999999 

21,947,850 

4,124 

4,109 

Table  2;  Comparisons  of  Computational  Time 


UB,  the  value  presented  is  the  number  of  evaluations  of 
/■  which  occur,  each  of  which  takes  a  constant  amount 
of  time.  Thus  the  computational  time  for  a  clinician 
to  utilize  UB  is  proportional  to  the  value  presented  and 
may  be  prohibitive.  For  TGLB,  the  value  also  represents 
a  quantity  which  is  proportional  to  the  total  computa¬ 
tional  time  needed  to  utilize  TGLB  during  a  trial.  The 
value  presented  is  the  average,  over  all  trials,  of  the  total 
number  of  Ar  values  which  must  be  computed  for  index 
calculations  throughout  the  trial.  While  space  require¬ 
ments  were  not  tabulated,  recall  that  UB  needs  0(n^) 
space  and  TGLB  needs  only  0(1)  space. 


The  results  of  our  investigations  are  summarized  in  Ta¬ 
bles  1  and  2.  The  computational  techniques  used  are 
explained  in  [3]. 

Table  1  shows  that  TAA,  which  is  optimized  to  make 
the  correct  selection,  incurs  a  large  ethical  cost,  while 
UB,  which  is  optimized  to  minimize  failures,  has  a  poor 
discrimination  ability.  The  TGLB  rule  is  a  compromise 
with  nearly  the  power  of  TAA  and  nearly  the  ethical  be¬ 
havior  of  UB.  Note  that  TGLB  has  an  extra  parameter, 
/?,  which  must  be  adjusted  to  optimize  its  performance. 
One  can  show  that  /?  must  converge  to  1  as  n  increases 
in  order  to  obtain  increasing  power.  The  specific  values 
of  3  used  have  been  indicated. 

Fable  2  compares  UB  and  TGLR  on  computational 
grounds.  TAA  was  not  included  since  the  total  compu¬ 
tation  time  is  merely  proportional  to  n.  i.e.,  0(n).  For 
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Abstract 

'  The  convergence  of  Geographical  Information  Systems 

,  (CIS)  and  Statistical  Dynamic  Graphics  has  led  to  the 

development  of  a  number  of  new  concepts  in  spatial  data 
,  analysis.  This  paper  discusses  how  such  ideas  may  be 

extended  in  the  context  of  spatial  modelling.  The  software 
environment  we  use  -  REGARD  -  incorporates  the  GIS  ideas 
of  point,  line  and  region  'layers  of  data'  pertaining  to  an  area 
under  study. 

I  The  spatial  variation  here  is  modelled  by  a  variogram.  The 

!  model  may  be  used  to  generate  new  statistical  views  of  the 

i  data,  which  may  be  regarded  as  diagnostics.  Aspects  of  the 

'  model  may  be  decomposed  spatially  in  the  linked-views 

I  environment.  Many  such  aspects  ace  naturally  viewed  as 

j  being  defined  point-wise.  Aqiects  which  refer  to  data  pair- 

'  wise  can  naturally  be  associated  with  line-objects,  h^y 

[  analyses  are  oriented  to  defining  regions  of  anomalous 

;  behaviour.  The  paper  will  illustrate  these  intert-twined  ideas. 

I  1  Introduction 

f 

[  The  linked  windows  concept  in  dynamic  graphics  supports  a 

number  of  views  of  the  data.  Each  view  focuses  on  some 
*  relatively  simple  aspect  of  the  data;  dynamic  linking  of  these 

I  provides  a  platform  for  understanding  the  variation.  We 

I  introduce  below  a  new  generalisation  of  such  views  for 

'  spatial  data  and  exploit  it  here  for  the  very  specific  purpose 

!  of  studying  diagnostics  for  spatial  models. 

'  Spatial  data  may  best  be  thought  of  as  data  which  are  defined 

on  objects  which  have  location.  A  central  view  in  spatial 
I  data  analysis  is  therefore  that  of  the  objects  and  of  their 

'  physical  locations;  we  call  this  a  Map  View.  The  simplest 

objects  are  point  objects.  Stream  sediment  data  are  naturally 
,  thought  of  as  multivariate  data  on  geochemical  composition 

,  associated  with  small  samples  which  may  be  thought  of  as 

points.  More  generally  one  may  think  of  data  on  lines  or  on 
t  regions.  Regional  data  may  be  administrative  or  geological. 

I  Disease  rates  are  naturally  defined  on  regions.  Lines  may  be 

I  roads  or  streams;  the  vari^les  may  be  stream  width  or  tr^fk 

'  on  the  road.  Many  projects  in  the  environmental  sciences  in 

I 


particular  in  fact  involve  integrating,  within  one  study,  data 
of  various  types  associated  with  different  types  of  objects. 
Mineral  exploration  thus  naturally  involves  assembling  data 
on  satellite  imagery  (pixels  -  regions  or  points),  geology 
(regions)  geochemistry  (points)  and  geological  faults  (lines). 
Many  other  examples  abound. 

One  way  to  approach  the  task  of  integrating  is  through 
Geological  Information  Systems  (Burrough,I986).  This 
supports  layers  of  information  which  may  be  superimposed 
both  visually  and  logically.  A  complementary  proposal, 
using  linked  views,  is  to  treat  each  layer  as  a  data  matrix 
comprising  a  list  of  objects  with  associated  attributes 
(variables)  and  to  support  a  Map  View  with  visually 
superimposed  point  line  and  region  objects,  each  of  which  is 
separately  associated  with  linked  views  of  the  attributes.  See 
Figure  1  and  Haslett  and  Cameron  (1990)  for  further  aspects 
of  this  study.  At  the  simplest  level,  selecting  a  case  in,  for 
example,  a  scatterplot  of  two  variables  causes  that  case  to  be 
highlighted  in  other  views  of  that  case,  including  its  object 
in  the  Map  View.  At  a  more  sophisticated  level,  selecting 
objects  in  one  layer  can  cause  objects  in  other  layers  to 
become  selected.  We  refer  to  this  as  cross-layer  linking. 
Thus  a  histogram  of  the  attributes  in  a  region  layer  can  be 
used  to  select  one  or  more  regions  of  interest;  these  regions 
can  then  cause  points  within  them,  but  in  a  point  layer,  to 
become  selected;  these  in  turn  will  be  highlighted  in  a  view 
of  the  attributes  of  the  points.  Such  ideas  have  been 
implemented  in  REGARD,  experimental  software  under 
development  in  Trinity  College,  Dublin.  See  Haslett  et 
al  (1990),  Haslett  et  al  (1991).  One  important  type  of  data 
which  lends  itself  well  to  such  analysis  is  data  defined  on  a 
network.  Such  data  are  commonplace  (traffic  flow  on  a 
telephone  network,  airline  traffic  between  cities,  trade  flows 
between  counties).  In  this  context  we  may  use  inter¬ 
connected  point  and  line  layers  to  analyse  the  d^  defmed  on 
such  a  network;  lines  connect  pairs  of  nodes. 

Thus  views  of  the  different  objects  and  of  their  attributes  can 
be  linked  to  each  otho*.  This  provides  one  avenue  towards 
the  integration  of  such  data.  An  alternative  use  of  such  a 
platform  is  in  the  spatial  decomposition  of  aspects  of  the 
data.We  illustrate  this  by  reference  to  a  spatial  model  of 
univariate  data  defmed  on  points.  We  will  see  that  a  network 
can  be  a  useful  vehicle  to  study  together  the  pairwise 
interactions  and  the  point  wise  data. 


1  The  support  of  Apple  and  of  EOLAS  (Dublin)  and  of 
CSIRO  (Sydney)  are  gratefully  acknowledged. 
REGARD  is  experimental  software,  written  by  Graham 
Wills  of  the  Department  of  Statistics,  TCD. 


426  J.Haslett  and  R.  Bradley 


2  Spatial  Decomposition 

We  begin  by  decomposing  the  mean  and  variance  of  data  on 
the  proportion  of  stone  to  be  found  at  various  point 
locations  in  a  field  (Burgess  and  Webster,  1980). 

2.1  Point-wise  decomposition  of  the  mean  and  Pair¬ 

wise  decomposibon  of  the  variance. 

Consider  data  z(ii)  defined  at  points,  ii,  i=l,..n.  Trivially, 
the  mean  of  these  is 

1  " 

z=;:Xz(iO 

"  1 

Thus  the  histogram  view  of  z(xi)  provides  the  possibility  of 

a  spabal  decomposition  of  the  mean  z  by  cross  referring  to 
the  physical  locations  in  the  Map  View.  See  Figure  2.  As 
the  smaller  values  are  to  be  found  in  one  part  of  physical 
space  (and  the  larger  values  in  another)  there  is  certainly 
spatial  structure.  The  mean  is  a  very  useful  summary  of 
unimodal  bell  shaped  disuibutions;  in  other  cases  this  is  less 
clear.  In  Figure  2  for  example  there  is  a  suggestion  that 
there  are  in  fact  two  modes.  This  is  reinforced  by  the  Map 
View. 

Less  trivially,  the  variance  may  be  written 

i>j 

Thus  a  histogram  view  of  (zfxj) .  z(xj))2  provides  the 
possibility  of  a  spatial  decomposition  of  the  variance  by 
cross  referring  to  the  physical  locations  in  the  Map  View. 
Since  (z(xi) .  z(xj))2  involves  a  pair  of  points,  it  is  natural 
to  associate  it  with  a  line  from  ii  to  ij.  Further  it  is  more 
natural  to  view  (z(xi) .  z(xj))2  through  a  scatterplot  against 
hij  =  I  ii  -  xj  I .  See  Figure  3.  This  in  fact  a  plot  of  the 
'variogram  cloud'  (Chauvet,  1982)  and  is  closely  related  to 
the  empirical  variogram  of  the  geostatistical  family  of 
models  of  spatial  variation.In  Figure  3  a  selection  has  been 
made  of  the  pairs  which  have  relatively  high  (z(xi) .  z(xj))2 
for  the  separation.  The  corresponding  lines  are  shown  in  the 
Map  View;  all  other  lines  (  n(n-l)/2  in  total  )  remain 
invisible.  The  variance  (and  more  specifically  the  empirical 
variogram)  has  been  spatially  decomposed  pairwise.  Certain 
pairs,  in  a  band  in  the  SE  and  associated  with  a  single  point 
in  the  NW,  contribute  strongly.  See  Bradley  and  Haslett 
(1990)  for  further  discussion. 

It  is  clear  that  there  is  a  local  outlier  in  the  NW  and  some 
suggestion  of  a  discontinuity  in  the  SW.  Below  we  see  that 
by  viewing  the  variation  through  a  simple  model  these 
issues  become  more  clearly  defined. 


2.2  Pair-wise  decomposition  of  the  likelihood. 

If  we  model  the  data  as  be'mg  a  partial  realisation  of  a  spatial 
stochastic  process  we  may  develop  model  derived  measures 
of  the  data  which  may  similarly  be  decomposed.  Specifically 
consider  the  geostatistical  model  in  which  the  data  above  are 
taken  to  be  a  partial  realisation  of  a  Gaussian  stochastic 
process  with  given  isotropic  variogram  Y(h)  (Joumel  and 
Huijbregts,  1978)  in  which 

E  {  i  (Z(xi).Z(xj))2  I  =  Y(  I  ii  -  ij  I )  =  Y(hij) 

Clearly,  under  the  model,  the  data  are  a  single  realisation  of 
a  multivariate  Normal  disuibution.  Consequently  the 
likelihood  of  the  data  can  be  written  as  a  quadratic  form  in 
the  (z(x.)).  A  convenient  representation  for  pair-wise 

decomposition  is: 

'2LogLikelihhood  = 
constant -f  y*  td.j{z(i.)-z(i.)) ^ 

i>j 

where  the  coij  terms  have  an  interpretation  close  to  the 
partial  covariance  of  (Z(ii),Z(ij))  given  the  rest  of  the  data. 
More  specifically  they  can  be  seen  by  considering  leave-onc- 
out  cross  validation,  the  estimation  of  each  data  point  in 
turn  from  ail  the  others.  Thus,  by  seeking  the  maximum 
likelihood  estimator  of  the  unknown  Z(x.i)  from 
{z(ij).  J  i*  i)  we  find  on  differentiation  that 

Z(2ti)  =  ^wijzfi-)/  Xtoij  =  ^X.ijz(x.) 

j  J 

and  that  the  variance  of  this  estimator  is  a^(2Li)  =  2®ij- 

The  (Oij  are  thus  proportional  to  the  kriging  weights  Xij 
used  in  cross-validation.  They  may  also  be  interpreted  as 
proportional  to  the  correlations  between  cross-validation 
residuals  at  2!t-  and  Sj-ln  this  context  they  have  another 

interpretation,  that  of  pair-wise  leverage  on  the  likelihood. 
More  simply  stated,  for  a  pair  of  data  values  to  contribute 
significantly  to  the  (un)likelihood,  which  is  in  fact  a 
measure  of  lack-of-fit,  it  is  necessary  not  only  that 
{z(Sj)-z(ij))^  be  large,  but  that  £i>ij  be  large. 


against  (djj.  Pairs  which  are  high  in  both  are  important;  if 
their  are  also  located  in  one  part  of  space  that  part  of  space  is 
contributing  significantly  to  the  lack  of  fit  Thus  for  lack  of 
fit,  o)ij  is  a  more  useful  metric  than  h^j  against  which  to 
judge  the  separation  of  two  points.  In  Figure  4,  such  a  plot 
is  presented;  a  few  pairs  have  been  selected  and  are  presented 
as  lines  in  the  Map  View.  These  have  been  computed  using 
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a  variogram  fitted  to  the  data.  They  are  seen  to  communicate 
the  same  message  as  previously,  but  much  more  crisply. 
Note  the  cross  layer  linking:  selected  lines  in  the  line  layer 
cause  their  end  points  to  be  selected  in  the  point  layer.  We 
see,  naturally,  that  these  pairs  of  points  are  associated  with 
the  upper  and  lower  ends  of  the  distribution  of  the 
proportion  of  stone.  Point-wise  decomposition  of  the 
likelihood  is  also  possible  and  attractive.  In  particular,  it  is 
possible  to  show  that: 

-2LogLikelihhood  = 

constant  +  -  z(x.))/o2(2U) 

i>j 

A  scatterplot  of  {z(ij)  -  z(x.)}/CT(jti)l*  the  standardised  cross- 
validation  residual,  vs  (z(i.)/c(xi)}  can  therefore  [H'ovide  the 

basis  for  a  point-wise  decomposition  of  the  likelihood.  See 
Bradley  and  Haslett  (1991). 

3  Discussion 

This  paper  has  indicated  one  of  a  number  of  new 
possibilities  for  the  use  of  linked  windows.  Here  we  have 
concentrated  on  its  use  as  a  platform  for  research  on 
diagnostics  in  spatial  modelling.  New  pairwise  views  of  the 
data  can  be  supported;  it  is  thus  possible  to  investigate  a 
number  of  pairwise  diagnostics.  Two  such  have  been  offered; 
the  decomposition  of  the  empirical  variogram  may  be 
perhaps  described  as  a  pre-modelling  diagnostic,  and  the 
decomposition  of  the  likelihood  as  a  post  modelling 
diagnostic.  Spatial  decomposition  is  a  genei^  principle:  one 
can  often  ask  "  from  where  does  the  evidence  come  that 

. ".  The  likelihood  can  be  expressed  as  a  sum  of  terms 

defined  pair  wise,  as  above,  or  point-wise;  see  Bradley  and 
Haslett  (1991).  Other  pairwise  and  point-wise  diagnostics 
can  be  created  and,  using  our  platform,  can  be  investigated. 
Such  procedures  are  likely  to  be  particularly  valuable  in 
multivariate  spatial  processes. 

The  possibilities  of  integrating  data  of  different  spatial 
support  are  also  important.  In  this  context  the  ideas  of  cross 
layer  linking  provide  a  natural  vehicle  with  which  to 
examine  such  data.  See  Haslett  et  al  (1991)  for  further 
examples. 
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Figure  1;  Layers  of  data.  Shown  are  three  layers  of  data,  defined  on  points,  regions  and  lines.  The  data  pertain  to 
rainfall  data  in  30  meteorological  regions  of  New  South  Wales.  Variables  of  interest  are  defined  on  the  regions, 
on  points  within  the  regions  and  on  pairs  of  points.  The  objects  for  each  layer  are  visible  in  the  Map  View: 
statistical  views  of  some  of  their  attributes  are  shown.  In  each  layer  a  few  objects  have  been  selected.  Here 
there  is  no  formal  cross-layer  linking;  since  the  objects  occupy  the  same  physical  space  they  are  visually  cross 
referred.  The  use  of  colour  in  the  Map  View  is  recommended. 


Figure  2:  Data  on  the  distribution  of  the  proportion  stone  in  soil  samples  in  a  field.  Selecting  the  left  hand  tail  of  the 
distribution  shows  that  there  is  a  clear  spatial  structure  to  the  variation.  One  unusual  point  has  been  identified 


Dynamic  Graphics-Spatial  Models  429 


Figure  3:  With  each  pair  (line)  is  eissociated  the  distance  between  the  pair  and  the  squared  difference 
between  the  two  proportiorw.  A  scatter  plot  of  these  is  shown.  Some  of  these  have  been  selected. 
These  correspond  to  pairs  which  are  unusually  different,  given  their  separation.  The 
corresp>onding  lines  have  been  highlighted.  The  remainder  of  the  (very  many)  lines  remain 
invisible.  It  is  clear  that  many  such  pairs  are  to  be  found  in  a  region  in  the  SW  and  associated  with 
a  single  point  in  the  NW. 


Figure  4:  A  few  pairs  simultaneously  having  high  w  and  large  squared  difference  have  been  selected. 
These  correspond  to  the  same  featiires  as  in  Figure  3,  but  much  more  clearly  defined.  In  this  case 
cross  layer  linking  is  active:  thus  the  points  at  the  ends  of  the  selected  lines  have  been  selected  and 
the  point  data  values  are  shown  in  the  histograun  view  of  the  data  in  the  px)int  layer. 


£>jF.  Swayne,  A.  Buja,  ondN.  Hubbell 


AD-P007  183 


XGobi  Meets  S:  Integrating  Software  for  Data  Analysis 

Deborah  F.  Swa3me,  Bellcore,  dfs@bellcore.com  '/i  > 
Andreas  Buja,  Bellcore,  andreas@bellcore.com 
Ncincy  Hubbell,  University  of  Wisconsin  and  Bellcore  i' ' 


ABSTRACT 

This  paper  describes  an  approach  to  integrating  var¬ 
ious  computing  tools  used  in  data  analysis.  Integra¬ 
tion  is  accomplished  by  creating  direct  manipulation 
panels  which  control  and  link  disparate  software.  The 
linked  programs  can  perform  data  manipulation,  numer¬ 
ical  aucilysis,  static  or  dynamic  graphics. 

The  two  prototypes  described  here  are  integrated  sys¬ 
tems  that  are  used  to  control  data  analysis  sessions. 
XSmooth  coordinates  a  smoothing  session;  it  consists  of 
a  control  panel  and  a  plotting  window  and  has  a  link  to 
S.  XClust  coordinates  a  clustering  session;  it  uses  a  panel 
to  control  an  S  process  and  one  or  more  instances  of 
XGobi,  an  interactive  dynamic  graphics  program.  The 
prototypes  tun  in  the  X  Window  System™. 

1  Introduction 

There  is  a  great  deed  of  software  for  statistical  com¬ 
puting  available  for  UNIX® workstations,  but  no  sin¬ 
gle  system  can  do  everything.  A  system  which  is  rich 
in  data  manipulation  functioncility  may  lack  dynamic 
graphics;  a  dynamic  graphics  package  may  not  be  easily 
programmed  by  a  user;  a  system  which  is  easily  pro¬ 
grammable  may  lack  data  analytic  methods. 

An  analyst  might  then  wish  to  use  a  variety  of  soft¬ 
ware  on  a  single  problem.  This  can  prove  difficult  and 
cumbersome,  because  eeich  system  has  its  own  command 
syntcix  and  its  own  data  representation. 

One  solution  is  to  use  a  control  program  to  manage 
the  communication  between  these  different  elements.  Its 
own  command  syntax  should  be  quite  simple,  so  that  it 
does  not  burden  the  user  with  another  language  to  learn. 

We  are  exploring  a  method  in  which  we  create  a  con¬ 
trol  panel  which  becomes  the  user’s  means  of  interact¬ 
ing  with  several  other  pieces  of  software.  The  panel  itself 
has  a  direct  manipulation  interface  which  allows  the  user 
to  communicate  with  the  panel  by  selecting  buttons  or 
menu  items  with  the  mouse. 

X  Window  System  is  a  trademark  of  MIT. 

UNIX  is  a  registered  trademark  of  UNIX  System  Laboratories. 


Figure  1;  General  Model 

Each  control  panel  manages  a  single  application  which 
includes  some  or  all  of  the  following:  analytictd  and  data 
manipulation  software,  static  graphics  displays,  dynamic 
graphics  software.  The  analytical  software  can  be  writ¬ 
ten  by  the  user  or  it  can  be  an  independent  interactive 
program.  The  static  graphics  software  cein  be  indepen¬ 
dent  routines  or  part  of  a  package.  The  dynamic  graph¬ 
ics  software  is  to  be  XGobi.  Figure  1  shows  a  sketch  of 
this  approcich. 

The  panel  can  control  these  elements  in  a  few  differ¬ 
ent  ways.  In  the  simplest  case,  it  can  use  direct  function 
calls,  and  even  write  directly  into  data  structures.  In 
other  cases,  it  would  use  UNIX  interprocess  communi¬ 
cation  methods,  sending  and  receiving  data  over  pipes 
or  sockets. 

Elements  in  this  model  can  sometimes  communicate 
directly,  without  using  the  control  panel  as  a  translator. 
For  example,  instances  of  XGobi  can  shcire  data  using  an 
X  interprocess  communication  method;  in  fact,  that  is 
how  linked  brushing  and  identification  are  implemented 
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in  XGobi.  Another  sort  of  communication  between  ele¬ 
ments  occurs  if  the  analytical  software  is  S  and  the  static 
graphics  window  is  an  S  plotting  window.  In  that  c<ise, 
the  control  panel  sends  plotting  commands  to  S,  and 
relies  on  S  to  communicate  with  the  S  plotting  window. 

The  use  of  a  direct  manipulation  interface  to  coor¬ 
dinate  and  link  disparate  software  could  be  applied  to 
several  arects  of  statisticed  computing.  Two  examples 
^lre  optimization  problems  and  iterative  cycles  of  data 
analysis. 

In  an  optimization  problem,  the  control  panel  could 
continuously  print  the  value  of  the  function  to  be  opti¬ 
mized,  using  plots  to  enrich  the  feedback  to  a  user.  The 
user  could  interactively  adjust  parameters  in  response  to 
this  information.  For  example,  a  display  could  indicate 
that  the  routine  is  stuck  at  a  local  maximum,  and  the 
user  could  increase  a  step  size,  allowing  the  progr<im  to 
keep  searching  the  solution  space.  The  projection  pur¬ 
suit  methodology  developed  by  Cook  et  al.  (1991)  for 
XGobi  provides  an  illustration  of  this  approach. 

During  an  iterative  cycle  of  data  analysis,  an  analyst 
executes  a  command  that  applies  the  initial  model,  then 
studies  some  numericcd  and  graphical  output  to  evalu¬ 
ate  the  model.  After  examining  this  output,  the  analyst 
adjusts  the  model  and  re-executes  the  command.  In  re¬ 
gression,  for  example,  the  analyst  evaluates  the  model 
using  statistics  such  as  the  residual  sum  of  squares  and 
the  t-test  for  each  coefficient,  and  uses  graphical  out¬ 
put  such  as  residual  and  influence  plots.  In  response  to 
this  evaluation,  the  analyst  adjusts  the  model  by  adding 
or  removing  a  term,  eliminating  outliers,  and  so  forth. 
The  regression  may  be  repeated  many  times  before  the 
ancJyst  is  Anally  satisfied  with  the  model. 

In  such  a  data  anadysis  session,  a  user  wants  several 
kinds  of  information  readily  aveiilable  at  the  same  time; 
all  the  printed  values  returned  by  the  regression  func¬ 
tion,  various  diagnostic  plots,  and  plots  of  the  raw  data. 
The  analyst’s  work  can  be  made  easier  by  a  direct  ma¬ 
nipulation  interface:  the  recomputation  of  the  model  is 
reduced  to  a  couple  of  operations,  key  presses  or  clicks 
of  a  mouse  button. 

2  XSmooth 

Smoothing  was  chosen  for  the  initial  prototype,  for 
two  reasons:  first,  a  scatter  plot  with  an  added  line  or 
lines  is  the  only  graphical  output  required,  and  second, 
a  smoothing  session  is  well  captured  by  the  iterative  for¬ 
mulation  just  described.  The  data  analyst  repeatedly 
adjusts  one  or  more  smoothing  parameters  and  looks  at 
plots  of  the  smoothed  curve  and  scatter  plots  of  the  raw 
data. 
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XSmooth  is  an  integrated  system  with  only  two  ele¬ 
ments,  as  shown  in  Figure  2.  There  is  a  window  con¬ 
taining  both  a  control  panel  and  a  plotting  region,  and 
this  window  communicates  with  S,  which  performs  the 
smoothing  computations. 

Three  smoothing  functions  can  be  found  in  S  (louess, 
smooth,  and  spline),  and  each  has  at  least  one  smooth¬ 
ing  parameter.  An  S  user  who  wants  to  find  a  smoothed 
curve  for  a  pair  of  vectors  x ,  y  is  likely  to  experiment 
with  at  least  one  of  these  routines  severed  times.  First, 
the  scatter  plot  is  generated: 

plot(x,  y) 

Then  a  smoothed  line  can  be  added  to  the  plot  using 
lowess,  where  the  argument  f  is  the  freu:tion  of  the  data 
used  for  smoothing  at  each  x  point: 

lines (lowess (x,  y,  f=.3)) 

At  this  point  the  user  may  generate  several  different 
smoothed  lines,  then  decide  the  plot  has  become  too 
noisy,  regenerate  the  scatter  plot,  and  continue  to  add 
smoothed  lines  until  the  preferred  value  of  f  is  found. 

Using  XSmooth,  the  user  executes  an  S  function,  pass¬ 
ing  it  the  name  of  the  data  to  be  smoothed: 

xy  <-  cbind(x,  y) 

xsmooth(xy) 

An  XSmooth  window  appears,  initially  displaying  the 
smoothed  curve  generated  by  using  lowess  with  the  de¬ 
fault  value  of  f .  A  user  can  now  choose  to  work  with 
lowess  or  to  select  the  smooth  or  spline  function.  If 
lowess  is  chosen,  the  f  argument  is  adjusted  using  a 
scrollbar,  clicking  on  the  arrows  at  either  end  of  the 
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scrollbar  for  fine  control.  A  single  click  on  the  “Send” 
button  causes  a  new  smoothed  curve  to  be  generated  and 
plotted.  A  user  can  control  various  features  of  the  plot 
with  single  button  clicks:  whether  or  not  the  raw  data 
values  are  plotted,  whether  or  not  axes  and  axis  labels 
are  shown.  A  history  of  the  most  recent  three  smoothed 
curves  is  kept,  and  a  user  can  include  or  exclude  each  of 
the  three. 

To  communicate  with  S,  XSmooth  uses  UNIX  inter¬ 
process  communication  code.  In  response  to  a  button 
click,  XSmooth  assembles  a  command  in  S  syntax  and 
sends  it  to  S.  Then  it  captures  S’s  response  and  copies 
it  into  C  data  structures. 

3  XClust 

The  second  integrated  system  is  used  for  clustering. 
This  is  another  example  of  iteration  in  data  analysis, 
but  it  uses  a  greater  variety  of  graphical  output  than 
smoothing.  One  would  want  to  see  different  views  of 
the  cluster  structure,  such  as  a  dendrogram  in  a  static 
graphics  window  and  a  scatter  plot  in  a  high-dimensioned 
motion  graphics  system  with  brushing. 

XClust,  as  shown  in  Figure  3,  has  a  panel  which  con¬ 
trols  one  or  more  instances  of  XGobi  and  an  S  process 
with  an  S  plotting  window. 

To  perform  clustering  in  S,  a  user  is  likely  to  repeat 
a  sequence  of  operations  a  number  of  times,  using  di¬ 
agnostic  plots  to  guide  the  iterative  procedure.  First,  a 
distance  matrix  is  calculated,  using  one  of  several  dis¬ 
tance  metrics: 

d  <-  distCx,  metric="euc") 

Then  the  hierarchical  clustering  tree  is  determined,  us¬ 
ing  one  of  the  clustering  methods  available  in  S: 

tree  <-  hclust(d,  method^" compact",  sim) 

Now,  the  dendrogram  itself  can  be  plotted  and  studied, 
and  there  are  several  parameters  to  the  plotting  func¬ 
tion: 

plclust(tree,  hang,  unit,  level,  — ) 

At  this  point,  other  diagnostics  can  be  performed.  For 
example,  one  can  use  the  cutree  function  to  cut  the 
tree,  specifying  either  a  height  or  a  number  of  clusters. 
The  function  returns  a  vector  of  cluster  membership: 

ind  <-  cutree (tree,  k,  h) 

Using  this  vector,  one  can  make  pairwise  scatter  plots  of 
the  variables,  plotting  ecteh  cluster  with  a  different  color 
or  glyph.  One  might  plot  the  data  in  the  space  of  the 
discriminant  coordinates,  or  use  other  diagnostic  tools. 


Figure  3:  XClust  Model 

XClust  is  initiated  from  the  UNIX  command  line.  The 
XClust  control  panel  appears,  an  S  process  is  started, 
and  an  S  graphics  window  appears.  Figure  4  shows  the 
XClust  control  panel  and  the  S  graphics  window,  as  well 
as  a  Variable  Selection  Menu,  which  will  be  described 
later. 

To  start  a  clustering  session  with  XClust,  the  user 
types  in  the  name  of  the  S  data  to  be  used,  selects  a 
button,  and  the  functions  dist,  heiust,  and  plclust 
are  applied  to  that  data  and  the  tree  is  plotted  in  the  S 
graphics  window.  All  the  arguments  to  those  functions 
can  be  adjusted  using  menus,  buttons,  or  text  windows. 
The  user  selects  another  button  to  initiate  an  XGobi 
window  using  the  same  data. 

To  define  a  clustering  scheme  based  on  this  tree,  the 
user  can  click  on  the  S  graphics  window,  specif3ring  the 
height  at  which  to  cut  the  tree.  This  action  has  two 
results:  a  line  is  drawn  on  the  S  window  indicating  the 
height  of  the  cut,  and  the  vector  of  cluster  membership 
is  passed  to  XGobi,  which  then  redraws  each  point  using 
a  different  color  and  glyph  for  each  cluster.  The  result 
of  this  action  is  shown  in  Figure  5. 

To  investigate  the  validity  of  this  clustering  scheme, 
the  user  can  select  another  button  called  “xgobidiscr().” 
This  action  initiates  a  new  XGobi  window  containing  a 
plot  of  the  data  in  the  speice  of  the  discriminant  coor¬ 
dinates.  After  examining  the  scatter  plots  and  the  dis- 
criminemt  coordinate  plots,  a  user  may  decide  that  one 
or  more  variables  are  not  contributing  to  the  clustering 
among  the  data.  The  Variable  Selection  Menu  allows 
these  variables  to  be  eliminated  from  the  computation 
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Figure  4:  XClust:  Control  Panel,  S  Graphics  Window,  and  Variable  Selection  Menu 


of  the  distance  matrix.  If  a  change  is  made  in  that  menu, 
a  new  tree  is  calculated  and  plotted. 

An  additional  form  of  interconnection  between  pro¬ 
gram  elements  is  used  in  XClust:  instances  of  XGobi 
communicate  with  each  other  using  an  X  interprocess 
communication  method.  Selecting  a  button  in  one 
XGobi  window  causes  the  color  and  glyph  characteristics 
of  each  point  to  be  sent  to  other  linked  XGobi  windows. 
In  this  way,  the  points  in  the  XGobi  Mrindow  which  rep¬ 
resents  the  data  plotted  in  the  space  of  the  discriminant 
coordinates  reflect  the  cluster  identities  shown  in  the 
first  window. 

4  Conclusions 

In  this  work,  we  often  encountered  the  question  of 
when  to  write  new  software  and  when  to  use  existing 
code.  For  example,  we  wrote  the  code  for  the  static 
graphics  window  in  XSmooth  but  used  an  S  graphics 


window  in  XClust.  When  we  write  our  own  code,  we 
have  greater  control  over  it.  It  would  be  easy  to  link 
the  plotting  window  in  XSmooth  to  an  XGobi  window, 
for  instance,  as  XGobi  windows  are  linked,  and  such  a 
window  could  respond  to  mouse  events  in  a  very  flexible 
way.  On  the  other  hand,  when  we  use  existing  code,  we 
have  the  benefit  of  previous  authors’  work.  We  saved 
time  by  using  the  S  code  for  drawing  a  clustering  tree, 
at  the  cost  of  some  limitation  on  the  user’s  ability  to 
interact  with  that  window. 

In  future  work,  we  expect  to  encounter  that  Scime 
question  again,  in  choosing  each  element  of  the  inte¬ 
grated  system.  We  will  make  the  decision  by  balancing 
those  two  factors:  the  amount  of  work  we  expect  to  save 
by  using  existing  code,  and  the  amount  of  control  we 
want  over  the  element  of  the  system. 

We  think  this  model  has  wide  applicability,  and  we 
plan  to  work  with  it  further.  We  would  like  to  find  out 
whether  it  can  be  made  easy  to  program.  There  are 
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Figure  5;  XClust:  XGobi  and  S  Graphics  Window 


tools  that  are  intended  to  make  it  ecisy  for  users  of  the  X 
Window  System  to  create  windows  such  eis  these  control 
panels  and  to  attach  functionality  to  them.  We  want  to 
find  out  whether  these  tools  could  be  used  in  our  context. 
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Abstract 


When  a  data  analyst  meets  a  complex  dataset,  graphics 
displays  giving  overall  summaries  are  examined  first,  then 
more  specific  displays  that  highlight  observed  features  are 
studied.  Frequently,  this  involves  selection  of  subsets,  and 
point-and  click-methods  are  intuitive  and  effective.  Some¬ 
times  the  observed  features  are  investigated  by  altering 
details  of  the  analysis,  and  then  an  interactive  command 
interface  (like  S)  can  be  more  useful. 

A  rainfall  dataset  with  geographic  and  time  components  is 
used  as  an  example.  Graphics  displays  are  done  in  a  modi¬ 
fied  version  of  S  that  permits  multiple  graphics  windows, 
and  this  is  compared  with  xlispstat,  xgobi,  and  datadesk. 

1.0  Introduction 


Five  years  ago  the  S  Language  offered  two  examples  of 
dynamic  graphics  (brushing  and  spinning).  These  methods 
were  supported  on  special  graphics  terminals  that  never 
achieved  great  popularity,  but  the  concepts  of  these  graph¬ 
ics  techniques  inspired  many.  Since  then  these  ideas  have 
been  extended  to  more  general  notions  jf  how  linked 
views  of  data,  and  animation  can  be  helpful  to  the  data 
analyst.  Much  of  this  work  has  been  done  on  the  Macin¬ 


tosh.  It  is  probably  fair  to  say  that  these  systems  have  not 
been  widely  used  by  data  analysts,  because  the  software 
systems  jx-ovide  too  many  restrictions. 

We  are  interested  in  displaying  multiple  views  on  a  wotk- 
statiem  screen  to 

■  select  subsets  interactively 

■  investigate  relationships  by  highlighting 

■  setting  of  parameters  interactively 

We  wish  to  explore  how  well  these  concepts  of  dynamic 
graphics  and  linked  windows  fit  into  a  realistic  working 
environment  for  data  analysts.  We  have  chosen  an  exam¬ 
ple  dataset  that  does  not  fit  the  mould  of  either  brush  or 
spin,  and  is  not  trivially  small. 

2.0  The  Rainfall  Dataset 


The  dataset  to  be  examined  is  monthly  rainfall  for  70  years 
(1913  -  1982)  for  30  regions  of  the  state  of  New  South 
Wales.  The  Great  Dividing  Range  is  parallel  to,  and  close 
to  the  coast,  giving  high  rainfalls  along  the  coast,  and  low 
rainfalls  in  the  west.  The  north  coast  is  sub-tn^ical  with 
high  summer  rainfall,  while  in  the  south  the  mountain 
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range  is  higher  (Snowy  Mountains)  and  has  a  more  promi-  5.0  Lisp  Stat 
nent  winter/spring  rainfall.  This  is  indicated  in  Figure  1. 


5.1  Learning  a  New  System 

Learning  any  new  system  can  be  a  hassle  whether  it  be  a 
new  editor,  a  new  word-processing  system,  or  a  new  data 
analysis  system.  At  the  most  elementary  level  it  is  the  dif¬ 
ference  between 

fun (x) 


and 


(fun  x) 

That  is  just  the  beginning.  How  do  you  list  a  function? 
What  is  a  sensible  operating  environment  for  this  system? 


Our  aim  is  to  explore  the  data  looking  for  patterns  that 
may  be  interesting.  Clearly,  plotting  the  map  will  be  use¬ 
ful,  and  comparing  patterns  for  rainfall  between  regions 
will  be  one  feature  to  explore. 

3.0  DataDesk 


Datadesk  on  the  Macintosh  has  gone  a  long  way  with  the 
concept  of  linked  views.  It  can  display  scatterplots, 
lineplots,  barcharts,  piecharts,  boxplots,  and  probability 
plots.  Any  number  of  these  can  be  displayed  with  linkages 
between  them. 

However,  it  did  not  seem  that  we  could  plot  a  map  of  the 
regions,  so  we  did  not  pursue  DataDesk  any  further.  Of 
course,  we  went  out  of  our  way  to  identify  a  dataset  that 
did  not  fit  the  brush/spin  mould. 

4.0  Xgobi 


Xgobi  is  similar.  It  has  addressed  the  needs  of  brushing 
and  spinning,  and  done  a  really  excellent  job  of  it.  The 
user  controls  are  well  designed  and  operate  smoothly. 
Color  is  used  effectively,  and  re-scaling  and  rotation  are 
beautiful  to  watch. 

However,  it  does  not  appear  that  our  requirement  for  a 
map  of  regions  tits  in  at  all. 


So  given  these  likely  problems,  it  came  as  a  surprise  to 
discover  that  Lisp-Stat  is  fun,  and  a  challenge.  This  is 
largely  because  of  the  graphics  that  can  be  achieved,  and  is 
helped  by  the  book  which  guides  you  between  learning  by 
doing,  then  absorbing  new  concepts.  To  give  an  example, 
by  page  62  your  are  presented  with  an  example  of  a  dozen 
lines  of  code.  This  generates  a  scalterplot  and  a  slider  con¬ 
trol  (Figure  2)  which  sets  the  parameter  of  a  Box -Cox 
power  transfexmation.  As  you  move  the  slider,  the  plot 
shows  the  effect. 


Figure  2 
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5.2  Using  Lisp-Stat  to  look  at  rainfalls 

Plotting  the  map  of  regions  is  quite  straightforward.  Once 
the  data  variables  have  been  set  up  all  that  is  needed  is 

;  PLOT -MAP  -  draws  a  map  of  NSW: 

(defun  plot -map  0 

(def  plotmap  (plot -points 
mgns-x  mgns-y  : title 
"Map  of  NSW"  NIL) ) 

(send  plotmap  :x-axis  nil  nil  5) 
(send  plotmap  :y-axis  nil  nil  5) 
(send  plotmap  : dear-points) 

(send  plotmap  :size  380  270) 

(send  plotmap  : location  3  80) 
(dotimes  (i  30) 

(send  plotmap  : add-lines 
(select  reg-x  i) 

(select  reg-y  i) ) 

) 

(send  plotmap  : add-points 
centres-x  centres-y  ) 

(send  plotmap  : linked  t) 

) 

Instead  of  trying  to  add  a  label  to  each  region,  Lisp-Stat 
makes  it  very  easy  to  have  the  list  of  names  as  a  linked 
window,  so  that  as  you  point  at  regions,  the  corresponding 
names  highlight  (Figure  3).  Finally  Lisp-Slat  offers  a 
range  of  statistical  functions  that  can  be  used  to  explore 
the  data. 


6.0  S-PLUS 


6.1  The  versions  we  used 

We  started  this  work  using  a  version  of  New  S  (June  89 
tape)  that  had  been  modified  to  permit  multiple  graphics 
windows  simultaneously.  We  then  received  a  beta  copy  of 
S-PLUS  3.0  which  has  this  same  facility.  We  did  not  use 
any  other  new  facilities  of  S-PLUS  3.0  for  these  dcmonu-a- 
lions. 


Figure  3 


6.2  Locating  Regions 

The  first  demonstration  has  two  windows,  one  showing  a 
histogram  of  average  total  annual  rainfalls  for  the  regions, 
and  the  other  showing  a  map  of  the  regions.  Then  by 
pointing  at  any  bar(s)  of  the  histogram,  the  corresponding 
regions  of  the  map  are  shaded  in  the  same  color  as  the  his¬ 
togram  bar  (Figure  4). 


Figure  4 


The  skeleton  of  the  S  function  to  do  this  is: 
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demol  <-  function  0 
{ 


Xll()  #  open  first  window 
histogram ( . . . ) 

repeat  { 

loo. list  <-  locator (... ) 
if  (...)  {#  have  selection 
if  (first)  ( 

Xll()  #second  window 
plot  ( .  . )  #map 

) 

else  ...  #  select  2nd 
for(...)  #  over  selections 
for  (...)  #  regions 
polygon ( . . ) 

1 

else  brea)c  #  no  selection 

) 

....  #  finish  up 

} 

This  runs  fast  enough  (on  a  Sun  4),  and  the  main  short¬ 
coming  is  lack  of  visible  feedback  as  the  mouse  is  used. 
The  user  has  to  know  that  locator  will  be  expecting  input 
in  window  1.  While  one  can  add  S  code  to  provide  visual 
cues,  this  only  partly  solves  the  problem. 

6.3  Pointing  at  Regions 

The  inverse  operation  is  simple.  T'^st  place  a  map  on  the 
screen,  and  then  as  the  user  points  at  regions,  pop  up  a 
window  showing  some  summary  of  the  data  for  that 
region.  So  an  argument  to  this  function,  is  a  function  to 
produce  a  summary  graph  for  the  selected  region.  Obvious 
possibilities  are  the  total  annual  rainfall  for  the  70  years,  or 
the  average  monthly  rainfall  for  the  12  months.  Here  wc 
show  the  average  monthly  rainfalls  (Figure  5). 

6.4  Locating  Correlations 

Given  that  we  can  summarize  each  region  by  12  monthly 
averages  or  70  annual  totals,  we  can  look  at  correlations 
between  regions.  It  is  then  useful  to  be  able  to  relate  given 
correlations  to  the  map.  In  this  example,  the  correlations 
are  presented  in  a  histogram,  and  their  location  on  the  map 


Figure  5 

IS  shown  by  drawing  a  hne  between  the  two  regions.  We 
have  used  the  monthly  averages,  and  then  focussed  on  the 
small  number  of  correlations  that  are  negative  (Figure  6). 


Figure  6 


It  turns  out  that  all  these  negative  correlations  are  between 
the  sub-tropical  north-east  coast,  and  the  Snowy  Moun¬ 
tains.  The  next  step  is  then  to  look  at  the  monthly  patterns 
for  these  regions,  and  the  function  provides  for  this. 
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The  main  point  to  be  made  from  these  examples,  is  that  the 
big  advantage  of  S  is  the  flexibility  in  defining  the  graph¬ 
ics  required,  and  the  range  of  analyses  that  can  be  called 
on. 

For  a  given  region  there  are  two  plots  of  interest  The  aver¬ 
age  monthly  rainfall  for  each  of  the  12  months,  and  the 
pattern  of  wet/dry  years,  as  shown  by  a  plot  of  70  total 
annual  rainfalls.  They  are  both  shown  here.  The  annual 
total  rainfalls  show  a  smoothed  uend-line  fitted  using  a 
cubic  spline  routine  that  was  dynamically  loaded.  It  would 


be  tempting  to  provide  a  slider  for  the  adjustment  of  the 
smoothing  parameter  for  these  u-end-lines  (Figure  7). 

7.0  Conclusions 


Having  deliberately  chosen  a  data-set  that  was  not  well 
served  by  brushing  or  spinning  of  scatterplots,  it  is  not  sur¬ 
prising  that  we  did  not  make  progress  with  less  program¬ 
mable  tools  like  E)atadesk  and  Xgobi. 


tHO 
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Figure  7 


440  R.  Baxter  et  al. 


Lisp-Stat  provides  a  good  environment  for  implementing 
the  linked  views  required.  It  is  easy  to  program  the  graphs 
required,  and  the  responsiveness  of  the  system  is  good. 
However,  the  range  of  procedures  available  is  limited,  and 
at  the  next  stage  of  the  data  analysis  scenario,  this  may 
have  become  a  real  limitation. 

S-PLUS,  on  the  otha  hand  has  an  extensive  range  of  sta¬ 
tistical  functions,  and  it  is  likely  that  whatever  else  we 
would  want  to  try  could  easily  be  done.  The  grtqthics  are 
also  flexible  so  that  whatever  style  of  display  we  require,  it 
should  be  achievable.  The  multiple  graphics  windows 
make  the  multiple  views  possible.  However  the  shortcom¬ 
ing  is  the  degree  of  responsiveness  to  the  mouse. 

Since  S-PLUS  would  be  our  preferred  working  environ¬ 
ment,  the  ideal  solution  would  be  to  have  improved  facili¬ 
ties  for  providing  linked  displays  in  this  environment. 
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An  Application  of  Subregion  Adaptive  Numerical 
Integration  to  a  Bayesian  Inference  Problem  * 


Alan  Genz 

School  of  EE  and  CS 
Washington  State  University 
Pullman,  WA  99164-2752 

Abstract 

Well-tested  and  available  software  for  evaluating  multi¬ 
dimensional  integrals  of  moderate  dimensionality  may  be 
adapted  for  use  in  Bayesian  inference  via  elementary  pa¬ 
rameter  transformations.  We  illustrate  with  an  example 
from  cognitive  modeling  of  error  rates  in  computer-based 
tasks,  in  which  the  parameter  being  integrated  is  six¬ 
dimensional  and  the  integrand  itself  requires  a  product 
of  twenty  one-dimensional  integrations  for  each  function 
evaluation.  This  method  appears  competitive  with,  and 
may  be  superior  to,  alternative  methods  when  the  trans¬ 
formations  are  well  chosen. 

KEY  WORDS:  adaptive  integration,  hierarchical  models, 
multiple  integrals,  posterior  computation. 

1  Introduction 

Many  Bayesian  analysis  computations  require  eval¬ 
uation  of  multidimensional  integrals  in  the  form 
/  /i(A)L(A)7r(A)dA,  where  L(A)  is  the  likelihood  function, 
7r(A)  is  the  prior  density,  and  A  is  m-dimensional.  Among 
integration  problems  generally,  these  are  special  because 
the  likelihood  function  is  often  peaked  near  its  maximum 
and  thus  locally  of  an  approximately  normal  form.  In¬ 
tegration  strategies  often  try  to  take  advantage  of  this 
special  situation  (e.g.,  Geweke,  1989;  Naylor  and  Smith, 
1982).  Here  we  show  how  subregion  adaptive  numerical 
integration  (Berntsen,  Espelid,  and  Genz,  1991a,b)  over 
the  m-dimensional  unit  cube  may  be  used  following  an  el¬ 
ementary  parameter  transformation  that  accommodates 
the  local  form  of  the  likelihood  function.  We  apply  the 
method  to  computations  for  a  hierarchical  model  in  which 
m  =  6  but  each  evaluation  of  L(A)  itself  involves  a  prod- 

*This  work  was  supported  in  part  by  NSF  Grant  DMS-900812.5. 
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uct  of  twenty  one-dimensional  integrals.  Although  the 
results  given  are  for  only  one  specific  example,  the  gen¬ 
eral  approach  described  here  should  be  useful  in  a  variety 
of  other  applications. 

2  Adaptive  Integration 

Subregion  adaptive  integration  methods  are  based  on  the 
fundamental  assumption  that  the  integrand  in  the  prob¬ 
lem  of  interest  can  be  accurately  approximated  locally  by 
a  low  degree  multivariate  polynomial.  The  basic  strat¬ 
egy  for  a  subregion  adaptive  algorithm  is  to  dynamically 
subdivide  the  initial  integration  region  R  into  smaller  and 
smaller  subregions  that  are  concentrated  in  the  parts  of 
R  where  the  integrand  is  more  irreguleir.  The  hope  is  that 
at  some  stage  in  this  process  the  region  R  is  sufficiently 
well  partitioned  that  the  combined  integrated  polynomial 
approximations  for  all  of  the  subregions  provide  an  accu¬ 
rate  approximation  to  the  initial  integral.  Typical  input 
for  this  type  of  algorithm  consists  of  (i)  a  description  of 
the  initial  integration  region  R,  (ii)  the  integrand,  (iii) 
an  error  tolerance  t  and  (iv)  a  limit  kmaz  on  the  total 
number  of  subregions  allowed. 

The  adaptive  algorithm  itself  also  requires  a  basic  in¬ 
tegration  rule  (or  formula)  B  and  an  associated  error  es¬ 
timation  rule  E.  We  let  Bi  be  the  approximation  to  the 
integral  in  a  subregion  Hi  obtained  using  the  basic  rule, 
and  let  Ei  be  the  estimate  for  the  absolute  error  in  5,  .  If 
at  some  stage  in  the  algorithm  R  has  been  subdivided  into 
k  subregions,  the  relevant  pieces  of  information  are  kept 
in  alistS  =  {(Ri ,  Bi  ,£■]),  (Ra,  ^2,  f'z),  (R*.  R*.  £'*)}■ 

Initially  we  set  R\  —  R  with  ib  =1  and  compute  B\  and 
E\.  There  are  many  possible  adaptive  strategies  that 
may  be  used  to  dynamically  refine  the  list  S.  The  soft¬ 
ware  (Berntsen,  Espelid,  and  Genz,  1991b)  that  we  have 
used  for  the  tests  described  in  Section  5  uses  a  globally 
adaptive  algorithm.  The  main  loop  for  this  algorithm  has 
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the  general  form; 

while  Ei>  t  and  k  <  kmar)  do 

a)  determine  j  with  Ej  =  inax^_iEi 

b)  divide  Rj  into  two  pieces;  Rj  =  Rj  U  Rk^i 

c)  compute  {Bj,Ej)  and  ,  £’it+i ) 

d)  set  k  =  k  +  I 
end  while 

In  step  (b)  the  selected  subregion  is  divided  in  half 
along  the  coordinate  axis  where  the  integrand  is  (locally) 
most  rapidly  changing  (see  Berntsen,  Espelid,  and  Genz, 
1991a  for  details).  The  output  from  the  algorithm  is  an 
estimate  integral,  and  an  error  e.stimate 

While  software  for  this  type  of  algorithm  has 
not  been  widely  used  by  statisticians,  it  has  been  available 
for  several  years.  Software  that  uses  a  globally  adaptive 
subdivision  algorithm  was  available  in  the  NAG  library 
starting  in  1980,  and  in  CMLIB  starting  in  1985. 

3  Transformations 

In  order  to  use  a  subregion  adaptive  algorithm  we  first 
need  to  apply  transformations  to  the  integration  vari¬ 
ables  so  that  the  region  of  integration  becomes  a  hyper- 
rectcingle.  Many  Bayesian  analysis  problems  (including 
the  problem  that  will  be  discussed  in  Section  4)  have  a 
prior  density  function  that  is  a  product  of  commonly- 
occurring  one  or  two  variable  density  functions,  like  the 
normal  or  gamma  density  functions.  In  this  case  an 
obvious  choice  for  a  prior  transformation  is  simply  to 
use  the  appropriate  cumulative  distribution  function  to 
transform  each  of  the  variable.s.  For  example,  if  one 
of  the  integration  variables  x  has  an  associated  factor 

(  prior  with  integration  limits 
— oo  and  'Xi,  then  the  change  of  variable  r  =  /i-t-<T$“’(r) 
with  4>(t)  =  e~^  allows  the  removal  of  the 

associated  exponential  factor  from  the  prior,  an  !  the  z 
variable  integration  limits  become  0  and  1.  A  .sequence 
of  transformations  like  this  one  can  provide  (he  hyper- 
rectangular  domain  of  integration  needed  for  a  subregion 
adaptive  algorithm. 

Although  this  approach  t?iight  seem  limited  to  situ¬ 
ations  in  which  the  prior  is  a  product  of  distributions 
such  as  normal  and  gamma  and  for  which  there  are  avail¬ 
able  good  numerical  routines  to  evaluate  the  distribution 
function,  a  sim()le  modification  of  this  approach  would  be 
to  base  the  tratisformations  on  elenientary  distributions 
chosen  in  some  convenient  way  but  without  necessarily 
matching  the  marginal  priors.  This  could  require  a  lot 


of  work  from  the  user,  however.  Furthermore,  when  the 
algorithm  is  applied  following  this  kind  of  transformation 
it  may  lose  efficiency  by  spending  large  amounts  of  time 
finding  the  subregions  in  which  the  contributions  to  the 
integral  are  large.  For  the  same  reason,  any  kind  of  a  pri¬ 
ori  specification  of  the  transformation  is  likely  to  produce 
an  integrand  that  is  poorly  suited  to  the  adaptive  algo¬ 
rithm.  In  order  to  obtain  more  efficient  computation  a 
transformation  should  force  the  algorithm  to  concentrate 
function  evaluations  where  the  integrand  contributes  sub¬ 
stantially  to  the  integral. 

The  general  transformation  method  we  use  here  is 
based  on  the  assumption  (Chen,  1985)  that  the  posterior 
density  function  is  approximately  multivariate  normal.  In 
this  case,  L(A)7r(A)  =  c  ^  for  some  con¬ 

stant  f,  and  optimization  can  be  used  with  log{L(X)ir(X)) 
to  obtain  the  posterior  mode  p  and  the  modal  covari¬ 
ance  matrix  E.  If  CC‘  is  the  Cholesky  decomposition 
of  E  then  we  may  transform  the  original  integral  using 
A  =  p  -f-  Cy,  followed  by  inverse  normal  transformations 
on  the  individual  y  components  to  obtain 

I  ...  /  /i(A)p(A)dA  =  |C|  /  ...  /  /i(z)s(y(z))dz, 

*/— do  J —  OO  Jo  Jo 


where ^(y)  =  (2!r)2  ey'y/^p(p-|-Cy),  h(z)  =  /i(p-FCy(z)) 
and  y(z)  =  (<l»-’(ri), ..., 4>-'(z„))‘. 

This  joint  transformation  is  more  straightforward  to 
apply  than  one  that  would  consist  of  separate  transfor¬ 
mations  for  each  of  the  integration  variables.  In  addition, 
as  long  as  the  stated  assumption  of  approximate  normal¬ 
ity  is  correct,  this  method  should  be  reasonably  efficient 
because  the  transformed  integrand  should  behave  roughly 
as  a  piecewise  low-order  polynomial  comprised  of  a  com¬ 
paratively  small  number  of  pieces. 


4  An  Example 

An  article  by  Carlin,  Kass,  Lerch  and  Huguenard  (1990) 
considers  two  cognitive  models  for  predicting  error  rates 
in  computer-based  trtsks  using  the  cognitive  psychologi¬ 
cal  concept  of  human  working  memory.  Here  we  discuss 
in  detail  the  computations  for  one  of  these.  The  more 
complicated  model  involves  an  integral 

/OO  fro  fOO  fOO  yCO  fOO 

/  /  /  /  /  h(X)L{X)n(X)dX 

•roJ  —  coJ  —  roJO  */  — cc»'— oo 

with 
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The  prior  5r(A)  is  a  product  of  normal  densities,  ex¬ 
cept  for  the  (Te  variable  which  contributes  a  factor 
g-a,  1^ .  The  likelihood  function  has  the  form 


with 


Here,  [x]  is  used  to  denote  the  integer  part  of  x  and  the 
^i,j,k,i  and  nij^k.i  values  come  from  experimental  data. 

For  prior  transformations,  we  use  appropriately  chosen 
inverse  normal  transformations  on  all  of  the  variables  ex¬ 
cept  for  erg.  For  this  variable  erg  =  d' /x,  with  d'  =  (§)», 
followed  by  an  inverse  normal  transformation  gives  (in 
one  variable) 


Using  these  transformations  on  the  prior  and  an  inverse 
normal  transformation  on  each  of  the  Likelhood  inner  in¬ 
tegrals,  I{h)  now  becomes 


m  =  1'  Hz)^-^C-^f^Liz)dz, 

where 


9i0f\si,j),X{z))dsij, 


h{z)  =  /i(A(z))  and  -b  <t#<5  *(s,,j). 

The  modal  approximation  transformation  can  be  used 
almost  directly  with  the  integral  /.  The  only  difficulty 
occurs  with  the  variable  erg,  which  has  limits  0  and  oo. 
We  use  an  initial  transformation  er'g  =  log(erg),  and 
then  use  optimization  to  obtain  the  mode  A  and  modal 
covariance  matrix  E  with  Cholesky  factor  C. 

The  integral  1(h)  can  then  be  put  into  the  form 

1(h)  =  (2n)^\C\  t  ...  l\(X(x))p(X(z))dz, 

Jo  Jo 

with 

p(A(z))  =  e“'^+^*'-^"''h(A(z))i(A(z)),r(A(z)), 

where  A(z)  =  (wj,  102, u)3,e"'<,i()5,u;6)‘  is  defined  using 
w  =  A -bC'y(z)  with  y(z)  =  ($-’(21),  ...  ,<I>-‘(2„))‘. 


5  Computations 

The  integral  calculations  all  require  a  six-dimensional 
outer  integral  of  a  function  that  requires  a  product  of 
twenty  one-dimensional  inner  integrals.  The  inner  inte¬ 
grals  were  all  computed  with  a  simple  subregion  adaptive 
one-dimensional  quadrature  algorithm.  This  algorithm 
is  similar  to  the  algorithm  used  by  the  QUADPACK 
(Piessens,  deDoncker-Kapenga,  and  Kahaner,  1983)  sub¬ 
routine  QAG,  with  a  7-15  point  Gauss-Kronrod  pair  cho¬ 
sen  for  the  basic  integration  rule.  The  outer  integrals  were 
computed  with  a  subregion  adaptive  m-dimensional  algo¬ 
rithm  using  the  SCUHRE  (Berntsen,  Espelid  and  Genz, 
1991a,b)  subroutine  for  vectors  of  integrals. 

One  problem  with  the  computation  of  the  inner  inte¬ 
grals  was  how  to  set  the  required  level  of  accuracy.  Since 
the  computation  of  all  of  these  inner  integrals  is  what 
takes  most  of  the  time  in  the  calculations,  it  was  neces¬ 
sary  to  choose  this  parameter  with  some  care.  The  re¬ 
sults  in  Tables  1  and  2  used  a  relative  error  tolerance  for 
each  inner  integral  of  IQ—*.  A  significantly  smaller  value 
for  the  error  tolerance  significantly  increased  the  compu¬ 
tation  time  without  significantly  changing  the  results;  a 
much  larger  value  decreased  the  computation  time  but 
changed  the  results  significantly. 

A  second  problem  with  the  inner  integrals  involved  scal¬ 
ing.  Some  experimentation  showed  that  these  inner  inte¬ 
grals  had  values  that  were  typically  about  10“®.  Because 
a  product  of  twenty  of  these  could  cause  underflow,  we 
computed  log(L(X))  using  a  sum  of  the  logs  of  the  inner 
integral  factors.  This  sum  was  initialized  to  the  value 
210,  and  the  likeUhood  veJue  was  obtained  by  exponenti¬ 
ating  the  final  sum.  The  effect  of  this  intialization  was  to 
scale  the  integral  value  by  but  since  the  numbers  of 
interest  all  require  a  division  by  1(1),  these  scale  factors 
cancel. 

In  Tables  1  and  2  below,  the  results  are  given  for 
both  types  of  treinsformations.  The  adaptive  integra¬ 
tion  software  computed  the  vector  of  integrals  1(h)  for 
h  =  l,f/g\f/^\log('y),erg,a,0,  and  then  scaled  the  re¬ 
sults  by  7(1)  to  obtain  the  required  expected  values. 

Since  the  major  part  of  the  computation  is  the  compu¬ 
tation  of  the  likelihood  products,  the  time  was  reduced  by 
an  approximate  factor  of  1/7  by  the  simultaneous  compu¬ 
tation  of  all  the  integrals  (i.e.,  for  all  7  choices  of  h).  The 
prior  transformation  results  required  approximately  four 
hours  of  single  precision  computation  time  on  a  DECsta- 
tion  3100  (14  mips).  The  modal  transformation  results 
required  approximately  twenty  minutes.  In  both  tables, 
the  numbers  in  the  rows  labelled  “L(A)V'  are  the  num¬ 
bers  of  evaluations  of  L(X)w(X).  Several  columns  of  re- 
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suits  are  given  to  illustrate  the  speed  of  convergence,  and 
allow  some  estimation  of  the  accuracy  in  the  results. 


Table  1;  Prior  Transformation  Results 


L(A)'s 

5957 

11753 

23989 

47817 

i 

log(a) 

a 

/? 

1.068 

-4.328 

-4.961 

0.178 

1.397 

0.796 

1.065 

-4.322 

-4.964 

0.183 

1.391 

0.789 

1.068 

-4.320 

-4.946 

0.184 

1.396 

0.794 

1.064 

-4.314 

-4.939 

0.181 

1.395 

0.793 

Table  2:  Modal  Transformation  Results 


It  is  clear  that  the  modal  transformation  results  in  the 
first  column  of  Table  2  are  accurate  to  2-3  digits,  but  the 
prior  transformation  results  do  not  have  this  level  of  ac¬ 
curacy  until  the  last  column.  In  this  example  we  can  see 
that  the  modal  transformation  method  takes  about  1/100 
of  the  time  taken  by  the  prior  transformation  method  to 
achieve  a  comparable  level  of  accuracy.  The  modal  trans¬ 
formation  results  in  the  first  column  of  Table  2  were  ac¬ 
tually  obtained  by  the  adaptive  algorithm  using  only  two 
subregions,  'fhe  basic  integration  rule  used  here  hcis  de¬ 
gree  seven,  so  that  the  transformed  integrand  apparently 
has  a  good  local  degree  seven  polynomial  approximation. 
One  problem  that  we  had  with  the  subregion  adaptive 
software  was  with  the  error  estimates.  The  estimates  pro¬ 
vided  by  the  software  for  the  relative  errors  in  the  final 
column  Table  2  results  were  approximately  0.1,  while  the 
actual  results  apparently  have  much  smaller  errors.  This 
problem  is  not  uncommon  with  this  type  software,  where 
the  error  estimates  are  usually  very  conservative.  The 
usual  solution  to  this  problem  is  to  take  the  approach 
that  we  have  taken,  and  that  is  to  estimate  accuracy  by- 
looking  at  the  level  of  agreement  between  results  from 
finer  and  finer  subdivisions. 

6  Concluding  Remarks 

The  reported  results  demonst  rate  the  potential  of  subre¬ 
gion  adaptive  integration  for  solution  of  numerical  inte¬ 
gration  problems  in  Bayesian  analysis.  'I'he  results  also 
demonstrate  the  importance'  of  choosing  a  good  transfor¬ 
mation  to  precondition  the  problem  Ix'fore’  a  numerical  in¬ 


tegration  method  is  used.  The  prior  transformations  that 
were  used  to  obtain  the  less  accurate  results  are  trans¬ 
formations  that  might  naturally  be  chosen  by  a  numeri¬ 
cal  analyst,  without  a  deeper  knowledge  of  the  expected 
approximate  multivariate  normal  structure  for  the  com¬ 
plete  integrand.  On  the  other  hand,  subregion  adaptive 
integration  methods  are  not  widely  used  by  statisticians. 
The  relatively  small  number  of  function  evaluations  we 
found  to  be  required  when  using  the  modal  transforma¬ 
tion  makes  us  optimistic  that,  together  with  this  kind  of 
modification,  subregion  adaptive  integration  could  prove 
to  be  competitive  with,  or  superior  to,  available  alterna¬ 
tives  for  solving  similar  numerical  integration  problems. 
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Abstract 

The  normal-logistic  convolution  arises  in  several 
statistical  applications,  including  logistic  regression 
models  and  multinomial  logit  models.  We  begin  by 
characterizing  the  logistic  distribution  as  a  scale 
mixture  of  normals.  We  then  construct  least 
maximum  approximations  of  the  logistic  distribution 
function  using  finite  discrete  mixtures  of  normal  dfs 
using  the  Remes  algorithm.  The  convolution  integral 
follows  by  convolving  this  approximation  with  the 
normal. 

1.  Introduction 

Logistic  regression  has  become  one  of  the  most 
popular  methods  of  analyzing  experiments  with 
discrete  or  binary  outcomes.  The  most  common 
situation  models  the  survival  probability  of  a  patient 
given  a  specified  dosage,  that  is 

Pr(  patient  survives  |  receives  dose  x  ) 

=  Pr(  Y=1  I  X  =  X  ) 

=  F(/?'^x) 

where  F(t)  is  the  logistic  distribution  function  F(t)  = 
(1  -I-  e*^)'^  =  e^  /  (1  -1-  e^)  and  /?  is  a  vector  of 
unknown  regression  coeflicients  to  be  estimated.  The 
normal-logistic  convolution,  G{t),t)  defined  by 

G{r,,r)  =  /  F(t)  r-^^(  ^  )  dt 

=  Pr(Logistic  RV  <  t,  t  ~  Normal(f7,  r^)) 

arises  in  three  related  situations:  measurement  errors, 
random  effects,  and  forecasts. 
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In  the  measurement  error  model,  we  observe  Z 
and  not  X  and  the  distribution  of  X  |  Z  =  z  is 
expressed  as  Normal(  /i(z),  n(z)  ).  Then  we  have  the 
observed  outcome  probability  structure 

Pr(Y=l|Z=z)  =  G(f,  =  /?T^(z),  r  =  0  ). 

In  the  random  effects  model,  most  commonly  random 
litter  effects  in  animal  experiments,  we  have  the 
random  effect  fj  for  litter  i,  and  subjects  j  within  the 
litter.  The  survival  probability  for  the  subjects  in 
litter  i  (with  covariates  Xj)  is  then  given  by 

Pr(  Yjj=  1  I  Xj,  fj  )  =  F{  -h  cj  ) 

so  that  the  expected  number  surviving  in  litter  i  is 

E(  Yj  I  Xj=  Xj)  =  nj  G(  r  ) 

where  t  is  the  standard  deviation  of  the  random  litter 
effect.  In  the  third  situation,  following  a  Bayesian 
argument,  one  may  from  the  logististic  regression 
analysis  of  an  experiment  have  an  approximate 
posterior  distribution  for  the  regression  coefficient 
vector  0  that  is  multivariate  normal, 

0  as  Normal(  ~0,  J'^). 

To  construct  the  forecast  probability  of  survival  a 
patient  with  covariates  X  =  x,  one  must  compute  the 
convolution  integral  in  the  form 

Pr(Y  =  1  I  X  =  X  )  =  G(  0^x,  >|xTj-lx  ). 

In  all  three  applications,  the  convolution  integral 
G{tf,T)  will  be  computed  often,  and  fast,  accurate 
caluculations  are  required. 
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2.  Key  Result  and  Implications 

Stefanski  (1990)  proved  that  the  logistic 
distribution  can  be  expressed  as  a  scale  mixture  of 
normals,  that  is  F(t)  =  J  $(ts)  q(s)  ds  where  ♦(.)  is 
the  standard  normal  distribution  function.  The 
mixing  distribution  q(s)  is,  strangely  enough,  related 
to  the  Kolmogorov-Smirnov  distribution,  but  that  is 
not  important  here.  The  following  calculations  show 
the  implications  of  this  result  to  the  computation  of 
G(  »?,  r  ): 

G(r?,  r)  =  /F(t)r-1^(  (t-r,)/r  )  dt 

=  /  {/4'{ts)  t^4>{  (t- »?)/’•  )  dt  I  dQ(s) 

where  the  expression  in  braces  can  be  expressed  in 
probabilistic  terms  for  simplification 

{  }  =  Pr(Z  <  ts  where  Z  ~  N(0,  1) 
and  t  ^  N(»j,  r^)  ) 

=  Pr(  Z  -  ts  <  0  ) 

=  4>(  VS  /  ^1  +  s^T^  ). 

The  convolution  integral  G  can  then  be  evaluated  as 


For  computation,  the  integration  with  respect  to  the 
mixing  distribution  dQ(s)  is  approximated  by  a  k 
point  discrete  distribution  Q|j(s)  with  masses  pj  at  Sj 
for  j  =  1,...,  k.  The  approximations  t  )  now 

only  require  evaluations  of  ERF/ERFC  : 


3.  Least  Maximum  Approximations 

The  expression  for  Gjj(  Vi  ’’)  shows  only  the 
premise  of  a  computational  method  for  computing  the 
convolution  integral  G,  since  the  discrete 
approximation  Qj^  in  terms  of  {  pj,  sj,  j=l,...,k  } 


must  be  determined.  While  the  ultimate  goal  is  to 

make  the  error  |  G  -  Gj^|  small,  achieving  this  over 

both  parameters  q  and  r  is  quite  difficult.  Making 

the  error  for  the  mixing  distribution  |  Q  -  Qj^l  small 

can  provide  a  bound  for  the  error  |  G  -  Gj^l,  but 

choosing  the  best  here  cannot  guarantee  small  error  in 

Gj^.  Our  approach  has  been  to  find  the  best  {  pj,  Sj, 

j=l,...,k  }  to  minimize  the  maximum  over  t  of 

k 

I  ^(t)  -  Fjj(t)  I,  where  F^(t)  =  £  pj  $(  t  sj)  the 
approximation  to  the  logistic  df.  An  accurate  least 
maximum  or  “Chebyshev”  approximation  to  F  will 
then  lead  to  an  accurate  approximation  to  G. 

Least  maximum  approximations  are  the  staple 
of  function  approximations  for  computer  evaluations 
(Hart,  et  ah,  1968).  However,  usually  the 
approximant  is  a  polynomial  or  a  rational  function. 


Taking  the  polynomial  case  for  illustration,  let 
P  ■ 

Hp(x)  =  ^  a-x^  be  approximating  the  function  H(x). 

j=0  ^ 

Then,  according  to  Chebyshev’s  Theorem,  the  least 


maximum  approximation  will  leave  the  difference 


function  d(x)  =  H(x)  -  Hp(x)  with  extrema  that 
2dternate  in  sign  and  achieve  equal  magnitudes.  With 


a  polynomial  Hp(x)  of  degree  p,  with  p+1  parameters, 
there  will  be  p+2  extrema  for  the  least  maximum 
approximation.  The  algorithm  commonly  used  for 
finding  such  an  approximation,  the  Rentes  Algorithm, 
follows  these  steps: 


Remes  Algorithm 

a)  Find  p+1  roots  {  Zj)  of  the  difference  function 
d(x) 

b)  Find  p+2  extrema  {  Xj) 

c)  Solve  the  linear  equations  in  {  aj,  j  =  0,....p}.  D 
d(xj)  =  H(Xi)-Hp(Xi)  =  (-l)'+>D 

for  i  =  1,  ...,  p+2 

d)  Repeat 
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In  our  situation,  the  approximant  F|^(t)  is  not  a 
polynomial,  but  a  complicated  function  F|^(t)  = 
P;  ts-)  with  2k-l  parameters  (]CPi  =  1)-  If 

j=l  ^  ^ 

analogous  approach  can  be  successful  here,  we  would 
hope  to  find  2k- 1  roots  of  the  difference  function 
Djj(t)  =  F(t)  -  Fjj(t)  and  "k  extrema  alternating  in 
sign  and  equal  in  magnitude.  Some  special 
considerations  include  symmetry  about  0,  since 
Djj(-t)  =  -  Dj^(t),  and  D(0)  =  0  because  F(0)  =  1/2  = 
J3pj4>(Osj).  Moreover,  since  F(t)  -*  0  slower  than 
4>(st)  for  large  t,  then  Djj(t)  will  be  negative  for  large 
t.  Also,  Dj^(t)  will  have  a  positive  slope  at  the  origin. 
These  considerations  lead  to  an  even  number  of 
extrema,  which  luckily  matches  the  odd  number  of 
parameters  2k-l.  Our  algorithm  for  finding  the  least 
maximum  approximation  differs  from  the  Remes 
algorithm  only  in  that  we  solve  a  system  of  nonlinear 
equations  in  {  pj,  Sj}  and  D  of  the  form  = 


In  spite  of  these  obstacles  we  have  been  able  to 
find  approximants  Fjj(t)  whose  difference  function  has 
extrema  that  alternate  in  sign  and  have  equal 
magnitudes,  and  we  suppose,  but  have  not  proven, 
that  the  approximants  are  the  least  maximum 
approximations.  For  small  values  of  k,  we  could 
obtain  starting  values  by  trial  and  error.  But  for 
larger  values  of  k,  we  found  starting  values  by 
successive  nonlinear  regression,  minimizing  over  {  pj, 
Sj)  the  weighted  sum  of  squares 

E  W;  [  F(Xi)  -  F^(xi)  ]  2 


By  taking  Wj  =  [  F(Xj)  -  Fjj(Xj)  ]  2^  we  are  able  to 
minimize  the  2r-f-2  norm,  and  by  increasing  r, 
approach  the  Chebyshev  solution.  We  were  able  to 
achieve  high  accuracy  for  these  approximations. 


Table  1  gives  the  values  of  DjJ  =  sup  |  Djj(t)  j  for  k  = 

1,  ...,  8. 

While  the  original  problem  was  the 
approximation  of  G(»;,  r)  by  G^(tj,  r),  an  accurate 
solution  to  the  approximation  of  F(t)  by  F|^(t)  was 
hoped  to  lead  to  an  accurate  Gj^(q,  r).  We  have 
found  that  since  our  approximations  Gj^(q,  r)  improve 
with  increasing  t,  the  error  Dj^  also  bounds  the  error 
1  G  -  G|j|  for  all  q,  r.  Other  approximations  based  on 
Taylor-like  expansions,  quickly  fail  when  r  increases. 
While  the  Crouch-Spiegelman  method  can  achieve  any 
level  of  accuracy,  their  approach  requires  many  more 
evaluations  and  its  use  is  not  appropriate  for  the 
applications  mentioned  here. 
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Table  1 

Accuracy  of  Approximations 


k 

Dk 

k 

K 

1 

9.5(-3) 

5 

6.0(-7) 

2 

5.1(-4) 

6 

8.4(-8) 

3 

4.4(-5) 

7 

1.3(-8) 

4 

4.7(-6) 

8 

2.1(-9) 
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Abstract 

Many  numerical  integration  algorithms  can  handle 
singular  integrals  effectively  if  they  are  told  the  locations 
of  the  singularities.  This  paper  gives  algorithms  for  (1) 
prescreening  the  integrand  for  the  locations  of 
singularities,  and  (2)  circumventing  compiler-imposed 
accuracy  restrictions.  In  application,  the  algorithms  are 
used  together  with  a  core  Romberg  type  algorithm. 
While  transformations  may  be  needed  to  handle 
"pathological"  integrals  that  converge  very  slowly,  most 
integrals,  including  nearly  all  probability  integrals,  can  be 
evaluated  quickly  and  accurately. 


Introduction 

Semi-automatic  numerical  algorithms  are  available 
allowing  the  user  to  perform  integration  by  simply  specify 
the  form  of  the  integrand,  the  limits  of  integration  and 
the  accuracy  requirement.  Some  algorithms  also  allow 
evaluation  of  singular  integrals  by  specifying,  in  addition, 
the  locations  of  singular  points  of  the  integrand.  This 
paper  presents  an  algorithm  for  detecting  the  location  of 
singularities.  Once  the  locations  are  known,  a  semi¬ 
automatic  procedure  can  evaluate  the  integrals 
accordingly. 

A  true  automatic  integrator,  like  a  robot  driven 
automobile  in  the  streets  of  New  York  City,  has  not  yet 
been  perfected.  Semi-automatic  integrators,  which 
require  a  minimal  amount  of  human  steering,  are 
available.  A  number  of  automatic  integrators  were 
presented  in  Davis  and  Rabinowitz  [3].  The  three 
qualities  required  of  an  automatic  integrator,  as  defined 
by  Davis  and  Rabinowitz,  are  efficiency,  reliability,  and 
robustness.  The  automatic  handling,  of  course,  defines 
the  fourth  quality  a  non-automatic  integrator  does  not 
possess:  convenience.  As  micro-computers  become 
increasingly  powerful  and  computer  time  less  costly,  other 
than  in  real-time  applications,  the  importance  of 
efficiency  is  fast  diminishing.  Since  reliability  often 
depends  on  efficiency  and  computing  power,  its 
prominence  is  also  declining.  A  little  mu.scle-flexmg  of 


computer  power  would  yield  more  accurate  results.  Part 
of  the  reliability  requirement  also  goes  together  with 
robustness,  since  the  integrator  is  required  to  handle  a 
broad  range  of  integrals  and  also  be  predictable.  A  key 
requirement  for  an  automatic  integrator  is  the  ability  to 
handle  singularities.  Otherwise,  an  integrator  can  be 
automatic  in  all  other  applications  but  will  break  down 
when  encountering  a  singular  integral. 

The  Core  Integrator 

The  core  integrator  is  an  extended  Romberg  (RE) 
algorithm  given  by  Wang  [6].  The  merits  of  classic 
Romberg  are  fully  explored  in  Bauer,  Rutishauser  and 
Stiefel  [1].  The  method  is  valid  for  all  Riemann  integrals 
(it  is  therefore  robust).  It  converges  rapidly  (and  is 
therefore  efficient).  Finally,  it  is  predictably  accurate; 
(i.e.  reliable).  The  performance  of  the  classic  Romberg 
method,  however,  depends  on  the  asymptotic  behavior  of 
the  integrand.  If  the  integrand  converges  slowly 
somewhere  on  the  interval  of  integration,  such  as  in  the 
neighborhood  of  a  singular  point,  however,  the  accuracy 
may  be  unsatisfactory.  The  RE  method  treats  the  range 
of  integration  "dynamically,"  adjusting  to  the  asymptotic 
behavior  of  the  integrand.  Essentially,  we  have  an 
integral 


such  that 


where  Oq  —  a,  =  b,  and 


’■^7*'=  j  fix)dx 


TTie  additive  decomposition  is  valid  provided  that  the 
conditions  for  the  Romberg  method-the  integral  is 
Riemann  and  bounded--are  satisfied  in  each  interval 


Automatic  Integrator  449 


(i^,  8:^,).  Mathematically,  if  the  integral  is  Riemann  in  In  the  interval  (a,,  b,),  for  j=l,...,n, 
(a,  b),  it  should  be  Reimann  in  (aj,  aj^.])  for  all  j.  In 

practice,  the  classic  Romberg  method  operating  in  (a,  b)  compute  yj  =  f(a  +  (j+l)h),  and 
may  not  get  close  enough  to  the  singularity  points  to  Dyj  =  (f(a  +  0  +  l)h)-f(a+jh))/h 

cause  problems,  even  though  it  would  not  yield  very 
accurate  results.  But  in  a  narrow  neighborhood,  such  as  5.  Zero  in. 
in  one  of  the  sub-intervals  (aj,  aj^.j),  a  singular  point 

would  be  more  likely  to  cause  computational  difficulties.  <0  or  DyjDyj  j  <  0, 


The  RE  algorithm  has  been  tested  successfully  for 
evaluating  internals  with  end-point  singularities  (as  well 
as  improper  integrals).  Evans,  Hyslop  and  Morgan  [4] 
gave  many  test  integrals  including  12  from  Chisholm, 
Genz  and  Rowlands  [2],  9  from  Harris  and  Evens  [5],  and 
10  of  their  own.  The  various  Gaussian-type  methods  used 
and  discussed  in  Evans,  et  al. ,  have  achieved  accuracy 
from  3  to  10  digits.  In  particular,  the  e-Patterson 
procedure  used  by  Evans,  et  al. ,  have  consistently  reached 
10-digit  accuracy.  The  RE  procedure  had  little  trouble 
getting  10-digit  accuracy  for  all  but  two  oscillatory 
(trigonometric)  integrals.  More  precisely,  RE  obtained  7- 
digit  accuracy  for  the  oscillatory  integrals,  and  IS-digit 
accuracy  for  all  other  integral.  With  the  exception  of  a 
pathologically  slowly  converging  integral 

1 

0 

no  functional  transformation  of  the  integrand  was 
required.  For  I j,  after  an  exponential  transformation,  RE 
obtained  the  exact  value  for  the  integral. 

Detecting  Singularity 

In  order  to  move  closer  to  an  automatic  setting,  we 
present  a  simple  algorithm  for  detecting  singularity  of  the 
integrand.  The  algorithm  is  based  on  a  search  process. 
The  idea  is  to  find  a  neighborhood  of  a  singular  point 
where  the  absolute  value  of  the  function  is  large,  and 
either  the  function  or  its  derivative  changes  sign. 


Readjust  the  intervals: 

a—a+jh,  b=a  +  (j+l)h,  and  go  back  to  step  1. 

6.  Repeat  the  steps  1-5,  until  some  point  (y=f(x^) 
where  |y|  is  sufficiently  large  to  resemble  infinity. 

Declare  a  singularity  point. 


The  Under-  and  Over-flowing  Problem 

In  order  for  the  algorithm  to  work,  we  must  work 
within  and  around  the  limit  of  the  computation 
environment.  The  arithmetic  processor  sets  a  boundary 
around  what  it  recognizes  as  a  valid  real  number.  For 
example,  a  given  Fortran  compiler  may  recognize  a 
(double  precision)  real  number  as  one  in  the  interval 
(-2’*'10^°  ,  2*10^*^*).  A  zero,  therefore,  can  retain  an 
accuracy  close  to  10'^*^*.  In  order  to  identify  a  singularity 
before  we  are  blown  out  by  an  overflowing  or 
underflowing  problem,  we  need  to  compute  using  those 
very  "large"  or  "small"  numbers.  A  relatively  easy  way  to 
circumvent  the  boundary  problems  is  to  retool  all 
arithmetic  operations  and  intrinsic  functions,  including 
multiplication,  division,  exponentiation,  logarithmic  and 
trigonometric  functions.  For  example,  the  division  x/y 
can  be  rewritten  as  x  div  y,  which  does  everything  x/y 
would  do  except  when  y  is  "zero", (say,  less  than,  10'^®^), 
at  which  point  it  will  stop  and  give  a  message,  without 
having  to  discontinue  the  program.  The  ordinary 
division  x/y, would  in  this  case  have  killed  the  program. 


1.  Let  h  —  (b-a)/n  for  suitable  n. 

2.  Select  a*  <  a-2h,  and  b*  >  b  +  2h,such  that  neither  a-a, 
or  b-b»  is  a  multiple  of  h. 

3.  Initialize. 

Compute  yQ  =f(at  +  h),  Dy^  =  (f(a,  +  h)  -f(a,))/h 

4.  Searching. 


The  Significant-Digit  Barrier 

The  second  limitation  the  arithmetic  processor  sets  a 
limit  on  accuracy.  For  example,  a  given  Fortran 
compiler  may  allow  a  double  precision  arithmetic  to 
carry  15  to  16  significant  digits  before  truncation  or 
rounding  takes  place.  Although  the  apparent  range  of 
valid  numbers  is  (-2*10^'^*,  2*10^®*),  it  is  full  of 
enormous  holes.  All  numbeis  that  must  be  represented 
by  17  or  more  digits  would  fall  into  those  holes. 
Consider  the  integral 


450  C.C.  Wang 


1 

I=j‘f(x)  dx 

0 

with  singularities  at  both  end  points.  The  lower  lioiit,  0, 
as  approximated  by  and  is  accurate  to  the  308th 

digit.  But  the  upper  limit,  1,  if  inexact,  cannot  be 
represented  from  below  by  anything  better  than 
0.9999999999999999.  As  a  consequence,  a  chunk  of  I, 
namely  the  integral 

1 

J=  J  f(x)dx 

0.9999999999999999 


Practically,  there  is  little  we  can  do  to  resolve  this 
problem.  For  certain  integrals,  but  not  in  general,  a 
transformation  of  the  integrand  can  shift  a  singularity 
away  from  a  non-zero  point  to  zero.  Then  things 
become  more  manageable. 

References 

1.  Bauer,  F.  L.,  Rutishauser,  H.  and  Stiefel,  E.  (1963). 
New  Aspects  in  Numerical  Integration,  Proceedings  of 
Symposia  in  Applied  Mathematics,  Vol.  XV,  Am.  Math. 
Soc. 


cannot  be  captured.  It  becomes  a  part  of  the  error.  Chisholm,  J.  S.  R.,  Genz,  A.  and  Rowlands,  G.  E. 

(1973).  Accelerated  Convergence  of  Sequences  of 

For  example,  the  value  of  it  can  be  represented  by  the  Quadrature  Approximations,  J.  Comp.  Physics,  10,  284- 
integral 


t  ^  (1-t)  ^  dt 

0 

treated  as  an  integral  with  singularities  at  both  ends,  or 
by  the  integrals 


3.  Davis,  P.  J.  and  Rabinowitz,  P.  (1984).  Numerical 
Integration,  Blaisdell,  Waltham,  Massachusetts. 

4.  Evan,  G.  A.,Haslop  J.,and  Morgan,  A.  P.G.  (1983). 
An  Extrapolation  Procedure  for  the  Evaluation  of 
Singular  Integrals,  Intern.  J.  Computer  Maths.  12,251-265. 


and 
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1 


1,-2 

1/2 
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each  treated  as  an  integral  with  a  singularity  at  one  of  the 
two  ends.  For  I3,  the  integrand  is  singular  at  0;  the  RE 
procedure  returned  a  value  of  3.141592653589793, 
accurate  to  16  digits.  For  I2  or  I4  ,  the  integrand  is 
singular  at  1.  In  both  cases,  the  result  is  3.141589..., 
accurate  to  five  digits. 


Note 

The  views  expressed  in  this  paper  do  not  necessarily 
reflect  those  of  the  U.S.  Department  of  Justice. 
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Abstract 

The  computation  of  the  multinomial 
distribution  function  is  of  interest  to  many 
researchers  and  practicers  who  axe  working  in 
the  areas  of  engineering  and  in  the  related 
disciplines  of  computing  sciences.  The 
accurate  computation  of  probabilities  is  very 
important  in  some  applied  areas.  A  direct 
computation  is  not  only  difficult,  due  to  the 
limitation  of  the  computer  systems,  but  also 
inaccurate,  as  a  result  of  many  redundant 
computations.  This  research  is  to  develop  an 
effective  method  to  compute  the  weight 
probabilities  for  the  multinomial  distribution 
function  accurately  and  efficiently. 

1.  Introduction 

Accurate  probabilities  of  the  multinomial 
distribution  function  are  very  important  in 
many  theoretical  and  applied  areas  related  to 
engineering  and  computing  sciences. 
However,  the  direct  computation  is  still 
considered  difficult  due  to  the  computation  of 
factorials  and  decimal  numbers;  the 
limitations  of  computer  systems  such  as 
overflow,  underflow,  and  the  maximum 
accuracy;  the  programming  techniques  such  as 
writing  a  reliable  and  efficient  program;  and 
the  time  consumed  by  computing. 

As  a  result,  currently,  there  are  no 
software  packages  available  for  the 
computation  of  the  multinomial  distribution 
function,  such  as  IMSL  Library  |5],  or  an 
effective  algorithm  for  dealing  with 
computations.  This  paper  presents  an 


effective  method  to  compute  the  probabilities 
for  the  multinomicJ  distribution  function 
accurately  and  efficiently.  Section  2  diccusses 
some  important  applications  of  the 
multinomial  distribution  function.  Section  3 
presents  the  new  method  cuid  its 
mathematical  foundations.  Finally,  some 
computationeJ  examples  are  given  in  Section  4 
tind  conclusions  in  Section  5. 


2.  Applications  of  Multinomial 
Distribution  Function 

Let  Fj  (i=l,  2,  ...  ,  k)  be  k  mutually 
independent  events,  and  the  probability  of 
occurrence  of  the  event  E,  is  equal  to  q-.  Then 
the  joint  distribution  of  the  random  variables 
n:  (i  =  1,  2,  ...  ,  k)  representing  the  numbers 
of  occurrences  of  the  events  (1,  2,  ...  ,  k) 
respectively,  in  N  trials  (with  n^+  +  ...  + 

y  =  N)  is  called  multinomial  distribution 
function  and  defined  by 

p(ni,  712,  - ,  ^)  =  A^!  n|Li(  0/^!  )• 

The  multinomial  distribution  function 
is  employed  in  many  diverse  fields  of 
statistical  analysis.  In  general,  it  is  used  in 
the  same  circumstances  as  those  in  which  a 
binomial  distribution  might  be  used,  when 
there  are  multiple  categories  of  events  instead 
of  a  simple  dichotomy.  For  example: 

In  computer  science,  a  program 
requires  I/O,  input  or  output  services  from 
device  i  with  probability  q-  at  the  end  of  a 
CPU,  central  procr.‘ising  unit,  with  +  ^2  + 
...  +  =  1.  This  situation  gave  rise  to  a 
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multinomial  distribution  problem  [6]. 

If  one  observes  n  CPU  burst 
terminations,  then  the  probability  that  n-,  of 
these  will  be  directed  to  I/O  device  i  (for  t  = 
1,  2,  ...  ,  k)  is  given  by  the  a  multinomial 
probability  mass  function.  One  may  also 
replace  the  “I/O  device”  by  a  word  “file”  for 
the  application  in  the  database  management 
system.  Another  application  of  multinomial 
distribution  function  occurs  in  a  operating 
system  when  we  consider  a  paging  system  and 
we  model  a  program  using  the  independent 
reference  model  [2].  In  this  model,  we  assume 
that  successive  page  references  are 
independent  and  the  probability  of  referencing 
page  i  is  q^. 

Another  important  field  of  application 
is  in  the  kinetic  theory  of  classical  physics  [4|. 
Particles  are  considered  to  a  cell  in  a  six¬ 
dimensional  space,  three  for  position  and  the 
other  three  for  velocity.  Each  allocation  of  N 
particles  among  the  k  cells  available 
constitutes  a  microstate.  The  thermodynamic 
probability  of  a  macrostate  is  proportional  to 
the  multinomial  distribution  function. 


3.  The  Method  and  Its  Mathematical 
Foundations 

This  research  will  apply  prime  number 
factorization  to  factorials  and  rewrite 
probabilities  (i=l,  2,  ...  ,  k)  in  the  simplest 
fraction  form  with  denominator  as  a  product 
of  prime  numbers.  Then  the  cancellation  of 
numerator  and  denominator  is  applied  to 
reduce  the  computational  complexities  to  the 
minimum  and  to  achieve  maximum  accuracy. 
Also  the  Ada  programming  language’s  special 
features  [1,  7],  “exception  handling”  and 
“tasking”  cire  used  to  handle  the  difficulties  of 
computation  such  as:  overflow  and  underflow 
problems,  redundant  multiplications  and 
divisions  of  the  same  numbers,  and  time 
comsuming  computation.  These  Ada  special 
features  will  make  the  computation  effective, 
efficient,  and  accurate.  The  mathematical 
foundation  for  thi.«  method  is  given  as  follows: 

In  order  to  reduce  the  computational 
complexity  of  P(ni,  rij,  ...  ,  ti^),  we  need 


theorems  from  the  theory  of  numbers  [3] 
which  is  stated  and  proved  as  below: 

Theorem  1.  Let  p  be  a  prime.  Then  the 
exact  exponents  of  p  that  divides  n!  is 


where  [x]  is  the  largest  integer  less  than  x. 
Proof:  For 

n!  =  1  •  2  •  3  (p  -  1) 

•  P  •  (P+1)  •  (p+2)  2p  (p-1)  p 

•  P'‘  ■  (p"+l)  •  (p^+2)  ••• 

•  P^  •  (p"+l)  ■  (p"+2)  ••• 

. (n-1)  •  n. 

We  see  that  the  number  of  p’s  factors 
is  [n/p]  the  number  of  p^’s  factor  is  [n/p^], 
the  number  of  p^’s  factors  is  [n/p®],  and  so 
forth.  Then  the  Theorem  follows. 

From  Theorem  1,  we  are  able  to  factor 
the  n!,  for  aJl  n  >  1,  as  a  product  of  prime 
numbers.  The  result  is  given  in  Theorem  2 
below: 

Theorem  2.  For  any  positive  integer  n  >  2, 
the  n!  can  be  written  as  a  product  of  prime 
numbers. 


I 

n!  =  Pi  •  P2  ■  P3  •••  Pk  • 

for  some  positive  integer  k. 

Example  :  Consider  the  20!,  we  have  the 
following  exponents  of  prime  numbers: 


The  exponent  of  2  is 


o 

C^J 

1 

’20 

1 

20 

20 

.2  . 

+ 

.2= 

.2^. 

+ 

2'*. 

=  10  +  5  +  2  +  1  =  18. 
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The  exponent  of  3  is  +  |^pj  =6  +  2=  8. 

The  exponent  of  5  is  =  4. 

The  exponent  of  7  is  =  2. 


The  exponents  of  11,  13,  17,  and  19  are  all 
equal  to  1. 

Hence,  the  201=2^8  -3®  -5^  -1^  11  -13  -17  -19. 


4.  Some  Computational  Results 

The  following  sample  results  were  obtained 
from  an  output  of  an  Ada  program;  this 
program  was  running  on  a  MicroVaxII 
machine.  The  program  can  compute  the 
probabilities  of  multinomial  distribution  up  to 
200  categories  of  events  and  some  results  and 
their  computation  time  are  given  as  follows: 


The  probability, 

P(4,  3,  2,  3,  5,  4,  3,  2,  3,  5) 

=2.991453222293589049380971718357348  E  —  11 

The  time  used  for  the  computing  is 

1.299438476562500  E  —  01 


(3)  ni  =  4, 

qi  =  0.010 

02  =  7, 

q2  =  0.090 

na  =  10, 

qa  =  0.020 

04  =  3, 

q4  =  0.010 

05  =  12, 

qg  =  0.050 

og  =  13, 

qg  =  0.120 

07  =  5, 

qy  =  0.130 

ng  =  7, 

qg  =  0.020 

ng  =  21, 

qg  =  0.050 

Oio  =  3, 

qio  =  0.250 

nil  =  9, 

qii  =  0.010 

O12  = 

qi2  =  0.020 

Oi3  =  6> 

9i3  ~  0.010 

Oi4  =  3, 

qi4  =  0.030 

nig  =  7, 

qi5  =  0.180 

The  probability. 

(1)  ni  =  10, 

n2  =  15, 
na  =  5, 

n4  =  12, 
ng  =  8, 


qi  =  0.200 

q2  =  0.300 

qa  =  0.100 

q4  =  0.200 

qg  =  0.200 


P(4,  7,  10,  3,  12,  13,  5,  7,  21,  3,  9,  11,  6,  3,  7) 

=2.702573447966758657909159323569038  E  —  47 

The  time  used  for  the  computing  is 

3.699951171875000  E  -  01 


The  probability, 

P(10,  15,  5,  12,  8) 

=4.260814931023565440708158220943350  E  —  04 

The  time  used  for  the  computing  is 

1.899414062500000  E  —  01 


Hi  =  4, 

q,  =  0.050 

n2  =  3, 

qa  =  0.050 

na  =  2, 

qa  =  0.200 

n-i  =  3, 

q4  =  0.300 

ns  =  5, 

qg  =  0.200 

06  =  4, 

qg  =  0.050 

07  =  3, 

q7  =  0.050 

Og  =  2, 

qg  =  0.200 

ng  =  3, 

qg  =  0.300 

Oio  =  3, 

qio  =  0.200 

5.  Conclusions 

The  computation  of  the  multinomial 
distribution  function  is  in  general  a  critical 
problem  due  to  the  limitation  of  the  computer 
systems  cind  progreunming  techniques.  The 
goal  of  a  computation  is  accuracy;  the  time 
consumed  for  the  computation  must  also 
remain  reasonably  small.  This  research  has 
developed  a  method  based  on  theorems  from 
the  theory  of  numbers  and  implemented  them 
in  the  Ada  programming  language,  the  former 
is  to  break  the  limitations  of  a  computer 
system  and  the  later  is  to  solve  the  technical 
difficulty  in  the  programming.  Since  the 
computation  of  the  multinomial  distribution 
function  is  a  number  theory  problem  in  the 
nature.  The  problem  can  only  be  solved  in 
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number  theory.The  results  given  in  Section  4 
have  reached  the  predefined  goal  for  this 
computation. 


References 

1.  Barnes,  J.G.P.  (1989)  Programming  in 
Ada,  Third  Edition,  Addison-Wesley 
Pulishing,  Reading,  Massachusetts. 

2.  Coffman,  E.G.  and  Denning,  P.J.  (1973) 
Operating  System  Theory,  Prentice-Hall, 
Inc.,  Englewood  Cliffs,  New  Jersey. 

3.  Hua,  L.K.  (1982)  Introduction  to  Number 
Theory  (English  Translation),  Springer- 
Verlay,  New  York. 

4.  Huang,  K.  (1965)  Statistical  Mechanics, 
John  Wiley  &  Son,  Inc.,  New  York. 

5.  IMSL  Library  (1984)  FORTRAN 
Subroutines  for  Mathematics  and 
Statistics,  User’s  Manual,  IMSL,  Inc,, 
Houston,  Texcis. 

6.  Trivedi,  K.S  (1982),  Probability  & 
Statistics  with  Reliability,  Queuing,  and 
Computer  Science  Applications,  Prentice- 
Hall,  Inc.,  Englewood  Cliffs,  New  Jersey. 

7.  Vax  Ada  (1985)  Language  Reference 
manual.  Digital  Equipment  Corporation, 
Maynard,  Massachusetts. 


4 


AD-P007  189  FGP  Machine  455 


FGP:  Using  Statistics  to  Drive  an  Expert  Database 

Scott  Fertig 

Deparimeni  of  Computer  Science 
Box  2158,  Yale  Station 
Yale  University 

New  Haven,  Connecticut  06520-2158 


92-19641 

liiiiillii 


ABSTRACT 

We  describe  the  FPG  machine,  which  uses  similarity-based 
retrieval  and  “simulated  speculation”  to  convert  pools  of 
data  directly  into  quasi-expert  advice.  The  central  oper¬ 
ation  is  the  retrieval  of  a  small  set  of  records  similar  to 
a  partially-instantiated  new  record.  The  system  uses  two 
statistical  techniques  to  improve  on  the  standard  Euclidean 
measure  for  calculating  distance  between  two  records  repre¬ 
sented  as  a  vector  of  features.  One  is  a  facility  to  automat¬ 
ically  weight  the  importance  of  features  which  will  add  or 
subtract  to  those  features’  contribution  to  the  overall  dis¬ 
tance  score.  The  other  is  a  means  for  separating  the  most 
relevant  records  from  the  rest  by  finding  a  natural  break  in 
the  ordering  of  the  records  by  distance  from  the  input.  We 
explain  the  role  these  techniques  play  in  the  overall  opera¬ 
tion  of  the  system  in  the  next  section;  the  algorithms  used 
for  the  calculations  are  described  in  the  appendices. 

1  INTRODUCTION 

The  program  we’ve  built  is  named  the  FGP  machine,  af¬ 
ter  its  basic  operations — fetch,  generalize  and  project.  We 
imagine  the  FGP  machine’s  database  as  a  collection  of  re¬ 
gions  in  space  (c/.  the  standard  vector  space  text-retrieval 
model).  Each  element  of  the  database  corresponds  to  some 
region.  Nearby  regions  correspond  to  nearby  cases.  When 
presented  with  an  inquiry,  the  machine’s  basic  task  is  to  add 
to  the  database  a  new  region  corresponding  to  the  inquiry. 
Stationing  itself  on  top  of  this  new  region  (so  to  speak), 
the  machine  then  looks  around  and  reports  the  identities 
of  the  nearby  regions — these  will  correspond  to  elements  of 
the  database  that  are  nearby  to,  in  other  words  closely  re¬ 
lated  to,  the  subject  of  the  inquiry.  We  can  then  inspect  this 
list  of  nearby  regions  and  “generalize” — determine  which  at¬ 
tributes  tend  to  be  shared  in  common  by  all  or  by  most  of 
them.  We  can  guess  that  these  common  attributes  are  likely 
to  hold  true  for  the  case  being  described  in  the  inquiry  as 
well. 

Having  reached  whatever  conclusions  seem  reasonable, 
the  machine  may  now  indulge  in  a  bit  of  simulated  specula¬ 
tion.  Temporarily  turning  aside  from  the  inquiry  in  hand,  it 
focusses  on  any  “evocative  possiblities”  that  may  have  sug¬ 
gested  themselves  during  the  examination  of  nearby  regions. 
An  “evocative  possibility”  is  a  datum  that  might  be  true, 
and  that  would  be  significant  if  it  were.  The  calculation  of 
evocativeness  is  discussed  in  appendix  A.  The  machine’s  in¬ 
teraction  with  the  user  represents  a  combination  of  fairly  safe 
conclusions,  speculation  experiments  and  the  subsequent  in¬ 
vestigation  of  resulting  guesses.  An  example  transcript  is 


shown  in  figure  1. 

The  system  can  operate  interactively,  but  here  it  is  work¬ 
ing  in  “commentary”  mode:  the  user  presents  an  entire  case; 
the  system  scans  it  element-by-element,  offering  comments. 
This  case  initially  seems  malignant  (note  the  early  mention 
of  related  cases  with  diagnoses  of  infiltrating  ductal  carci¬ 
noma);  the  fact  that  the  mass  has  not  changed  in  density  and 
has  no  comet  (contradicting  the  system’s  guesses,  which  in 
the  nature  of  guesses  will  often  be  wrong)  points  in  the  other 
direction  (“cyst”  and  “fed”  refer  to  benign  diagnoses);  but 
further  data,  particularly  the  absence  of  a  halo,  tips  the  bal¬ 
ance,  and  the  system  guesses  that  this  is  a  malignant  mass. 
This  guess  is  correct,  and  the  diagnosis  was  in  fact  infiltrat¬ 
ing  ductal  carcinoma.  This  transcript  is  driven  by  a  small 
collection  of  67  cases,  which  is  the  only  domain  knowledge 
provided. 

2  THE  MODEL 

An  FGP  machine  is  defined  in  terms  of  a  single  kind  of  data- 
object  and  three  primitive  operators.  These  define  a  virtual 
machine  in  terms  of  which  the  system  is  programmed.  We 
summarize  the  essential  points  in  the  remainder  of  this  sec¬ 
tion;  see  [1]  for  further  discussion. 

Data-objects  Sc  databases. 

FGP  machines  run  off  a  database  of  a  single  type  of  data- 
object,  a  feature  tuple  to  which  we  refer  generically  as  a  r.  A 
r  consists  basically  of  a  list  of  attribute-value  pairs;  we  give 
examples  below.  An  FGP  database  consists  of  an  unordered 
collection  of  r’s.  A  new  case  for  inclusion  in  the  database  is 
presented  as  a  r,  and  a  query  is  an  incomplete  r — a  partial 
list  of  attribute-value  pairs,  with  a  request  that  the  system 
fill  in  certain  missing  ones.  We  use  M  to  represent  an  un¬ 
ordered  collection  of  r’s;  an  FGP  database  of  stored  cases 
and  paradigms  is  an  Af .  £  is  a  list  of  r’s  ordered  on  their 
“closeness”  to  some  other  r:  we  explai.  below. 

Primitive  operators. 

The  three  basic  FGP  operations  are  fetch,  generalize  and 
project.  They  work  as  follows. 

fetch  maps  a  r  and  an  M  to  an  C:  given  a  feature-tuple 
and  a  database  of  feature-tuples,  it  produces  an  ordered  list 
of  those  feature-tuples  in  the  database  that  are  “closest  to” 
the  r  mentioned  in  the  query,  it  is  this  operation  that  makes 
use  of  the  statistical  techniques  that  are  detailed  in  the  ap¬ 
pendices. 

fetch  uses  a  two-step  procedure.  First  it  calculates  a  “dis¬ 
tance”  from  the  new  point  to  every  point  in  the  database'  ;all 

’This  calculation  needn’t  require  that  every  r  in  A4  be  examined, 
we  can  use  a  hash  scheme  to  direct  attention  to  t’s  that  occupy  com- 
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(AGE  G8) 

(HASSJ>EISITY  ISOJ>EISE) 
(MASSJORDEEjCONPLETE?  10) 
(HASS.TYPEJORDER  IRREGULAR) 
<HASSjaROERJ)EFIIED?  10) 


(HAS5X0CATI0I  UIL) 

(hass^izejchaiged?  yes) 
(HASSJ>E1SITYjCHAIGED?  10) 
(HASSXQHET?  10) 


(HASSJIALO?  10) 
(BACKGRaUIDJ)EISlTY  RODERATE) 


Speculating:  MASSMENSITY.CHANGED?... 
Guessing  INCREASED  -  e.g. 

case  ((id  14)  (age  46)  (diagnosis  CA.INFJ)UCTAh)) 
case  ((id  SO)  (age  70)  (diagnosis  CAANFSlVCTAh)) 

Speculating:  MASS-COMET?... 

Guessing  YES  -  e.g. 
case  14 

case  ((id  40)  (age  69)  (diagnosis  CA-INFJDUCTAL)) 


Speculating:  BACKGROUND-DENSITY... 
Guessing  DENSE  -  e.g. 
case  ( (id  21 )  (age  61 )  (diagnosis  cyst)) 
case  ((id  47)  (age  45)  (diagnosis  fed)) 


Concluding 

(ARCHITECTURAL-DISTORTION?  NO) 


Speculating:  MALIGNANT?... 

Guessing  YES  -  e.g.  cases  (2  6  8) 

Speculating:  SKIN-CHANGES... 

Guessing  RETRACTION  -  e.g.  cases  (2  8  28) 


(SKII-CHAIGES  10) 

(lIPPLE-IIVERSIOI?  10) 

(ADEIOPATHY?  10) 

(FAHILYJUSTORYJZAICER  SISTER) 

(PERSOIALJIISTORYjCAICER  10)  Closest  known  cases: 


(19)  (VES)  (CA-INF.DUCTAL) 
(33)  (YES)  (CA-INF-DUCTAL) 
(26)  (YES)  (CA-INF-DUCTAL) 
(28)  (YES)  (CA-INF-DUCTAL) 
(18)  (YES)  (CA) 


YES  has  been  concluded  or  guessed  for  MALIGNANT? 


Speculating:  DIAGNOSIS... 
CA? 

CA-INF-DUCTAL? 


Figure  1:  Transcript  of  an  FGP  machine  operating  in  the  domain  of  mammography.  The  user’s  case  description  is  in  the  left 
column,  the  system’s  commentary  on  the  right. 
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cases  further  away  than  some  parametrized  threshold  are 
removed  from  further  consideration,  fetch’s  calculation  not 
only  takes  into  consideration  the  number  of  shared  attributes 
and  their  types,  but  also,  in  the  context  of  a  request  to  fill 
in  values  for  missing  attributes,  the  “evocativeness”  of  each 
with  respect  to  the  current  goal — a  more  evocative  feature 
is  one  that  recalls  a  group  of  cases  with  a  more  highly  fo¬ 
cussed  set  of  values  for  the  goal.  The  evocativeness  of  an 
attribute-value  pair  with  respect  to  a  goal  attribute  is  in¬ 
versely  proportional  to  the  entropy  (disorder)  of  the  distri¬ 
bution  of  values  for  the  goal  represented  in  the  group  of 
cases  returned  by  fetch.  See  appendix  A  for  details  on  how 
this  value  is  calculated.  Next  fetch  checks  to  see  if  there 
exists  a  well-defined  group  of  “close”  points  among  those  re¬ 
maining  by  performing  a  crude  cluster  analysis.  Appendix  B 
describes  the  clustering  algorithm.  An  ordered  list  of  these 
close  points  is  returned  as  fetch’s  value. 

generalize  maps  an  £  to  a  r:  it  takes  an  ordered  list 
of  feature-tuples  and  compresses  them  into  a  single  new 
feature-tuple.  The  weighter  a  r  and  the  closer  it  is  to  the 
top  of  the  list,  the  larger  the  contribution  its  attribute-value 
pairs  make  to  the  combined  r  returned  by  generalize.  Sup¬ 
pose  we  query  on  the  r  (name  apple),  and  suppose  that  Ad 
holds  one  hundred  individual  apples,  half  red  and  half  yellow; 
a  generalize  operation  over  a  list  consisting  of  one  hundred 
apples,  half  red  and  half  yellow,  yields  a  single  r  that  might 
look  like  ((name  apple  100)  (type  fruit  100)  (color  (red  50) 
(yellow  50))  . . .  ). 

project  maps  a  r  to  a  r:  given  a  feature-tuple  it  returns 
a  new  tuple  constructed  from  a  subset  of  the  features  in 
the  original.  While  project  is  a  purely  syntactic  operation, 
it  is  used  by  higher  level  operations  (see  the  discussion  of 
refocus  in  [1])  to  change  contexts;  the  system  focusses  on 
those  attributes  and  values  that  are  evocative,  temporarily 
ignoring  other  information  on  hand. 

2.1  THE  BASIC  CYCLE 

Given  this  three-instruction  virtual  machine,  how  does  the 
system  operate?  The  basic  cycle  is  two  phase:  (1)  extend 
the  current  r;  (2)  choose  a  new  current  r,  and  repeat.  Step 
one  is  implemented  by  an  ei<end(r)  function  that  is  defined 
in  terms  of  fetch  and  generalize.  Step  two  is  implemented  by 
refocus  which  is  defined  in  terms  of  all  three. 

To  extend  a  r  —  to  discover  new  implications  given  our 
database  of  cases  and  pareidigms  —  we  begin  by  execut¬ 
ing  the  operation  generalize{fetch(M,  r)),  where  Ad  is  the 
database.  If  r,  for  example,  describes  a  particular  patient, 
fetch(M,  t)  will  return  a  list  of  remembered  r’s  that  are 
close  to  (similar  to  or  remi;  'scent  of)  this  particular  pa¬ 
tient;  executing  generalize  over  this  list  will  produce  an  amal¬ 
gam  of  all  these  remembered  cases.  Any  highly-focussed  and 
sufficiently-weighty  values  can  be  classified  as  conclusions:  if 
the  memories  examined  by  generalize  mainly  have  a  value  of 
“blonde”  for  attribute  “hair-color”,  say,  the  system  will  con¬ 
clude  that  (hair-color  blonde)  is  likely  to  characterize  this 
case  as  well.  It  reports  (hair-color  blonde)  to  the  user 
as  a  conclusion  and  augments  the  current  r  with  this  new 
attribute-value  pair.  The  system  attempts  to  conclude  any 
value  turned  up  by  the  fetch-generalize  combination  which 
hasn’t  yet  been  seen  in  the  context  of  the  current  query.  Val¬ 
ues  which  contradict^  are  withdrawn;  the  u.ser’s  input  always 


parable  subspaces 

^As  expected,  two  distinct  values  of  a  boolean-typed  attribute  al¬ 
ways  contradict.  System-concluded  values  of  other  types  of  attributes 
contradict  only  if  this  information  is  specified  in  the  attribute’s  dis¬ 
tance  metric.  See  [I]  for  more  details. 


takes  precedence  over  system  guesses.  The  extend  operation 
is  complete  when  all  values  that  can  be  concluded  have  been 
and  all  contradictions  removed. 

SIMULATED  SPECULATION 

Refocus  is  then  invoked  over  the  extended  r.  Its  role  is  to 
examine  a  r  and  refocus  attention  from  this  entire  r  to  one 
(possibly  small  and  conceivably  unrepresentative)  part  of  it. 
This  element  considered  in  isolation  may  serve  as  a  seed  for 
a  new  set  of  inferences.  We  call  this  process  “simulated  spec¬ 
ulation.”  refocus  may  choose  no,  one  or  many  data  points; 
each  chosen  data  point  becomes  the  current  r  in  turn.  The 
more  evocative  a  data  point  with  respect  to  the  goal — the 
more  sharply-defined  the  cases  nearby  a  r  consisting  only  of 
that  data  point  with  respect  to  the  goal  attribute,  in  other 
words — the  likelier  target  for  refocus.  The  more  sharply  a 
data  point  stands  out  from  the  pack — by  assumption  it  won’t 
stand  out  clearly  enough  to  qualify  as  a  conclusion,  but  there 
are  many  intermediate  shadings  here— the  likelier  a  refocus 
target. 

Typically,  the  system  will  examine  each  of  a  small  set 
of  values  associated  with  a  particular  attribute  whose  value, 
if  known,  would  focus  the  search  space  considerably.  The 
system  performs  the  basic  fetch-generalize  cycle  on  each  of 
these  seed-tuples  and  is  left  with  a  set  of  regions  in  vector- 
space.  One  may  be  much  closer  to  the  original  query  than 
the  others  and  may  therefore  be  mergeable  with  it.  The 
reader  can  see  the  system’s  behavior  during  several  refocus 
experiments  by  examining  the  transcript  shown  in  figure  1. 
re/ocus  first  announces  the  attribute  projected  to,  followed  by 
any  values  tentatively  guessed  as  the  result  of  the  speculation 
experiment.  It  then  gives  pointers  to  specific  cases  that  both 
have  this  value  and  also  resemble  the  rest  of  the  user’s  input. 

3  PERFORMANCE 

A  version  of  the  FGP  machine  was  implemented  in  the  T- 
dialect  of  Scheme.  There  are  approximately  5000  lines  of 
code  spread  among  10  modules. 

We  are  encouraged  by  our  initial  tests  of  the  system. 
Experiments  were  conducted  on  case  databases  in  three  do¬ 
mains,  one  of  which  we  discuss  here.  This  test  involved  a 
small  database  of  patient  records,  specifically  descriptions 
of  mammograms.  There  were  originally  88  records  in  the 
database;  20  cases  were  reserved  for  testing  and  67  were 
used  to  seed  the  system  spanning  13  possible  diagnoses  (one 
of  these  13  possibilities  was  the  diagnosis  normal  meaning 
no  disease  present)^.  The  system  was  presented  with  the  20 
test  cases  and  asked  to  judge  if  a  malignant  lesion  was  indi¬ 
cated  and  if  so  determine  a  specific  diagnosis.  As  discussed 
above,  the  system  would  present  a  short  list  (<4)  of  possible 
diagnoses  if  unable  to  decide  on  one  with  certainty. 

The  domain  expert  was  the  radiologist  who  had  compiled 
the  database  Working  from  the  descriptions  of  the  mam¬ 
mograms  alone,  he  accurately  judged  the  malignancy  of  the 
testcases  at  the  68%  level.  The  system  performed  slightly 
worse  at  63%.  However  the  system  outperformed  the  do¬ 
main  expert  in  producing  a  differential  diagnosis,  with  the 
right  answer  being  stated  outright  or  appearing  in  a  short 
li.st  of  possibilities  (<=  3)  70%  of  the  time  to  the  clinician’s 
60%  correct  performance 

^One  record  was  thrown  out  because  no  diagnostic  information  was 
included 

^Dr  Paul  Fisher  of  the  Department  of  Diagnostic  Radiology, 
Yale  University  School  of  Medicine.  Dr.  Fisher’s  clinical  specialty 
is  mammography. 
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4  CONCLUSION 

We  have  presented  the  main  features  of  a  methodology  for 
extracting  expertise  from  case  databases  automatically.  The 
FGP  systems’s  domain  independent  similarity- based  weight¬ 
ing  and  clustering  algorithms  support  retrieval  of  noisy  and 
incomplete  data,  drive  an  intelligent  interface,  and  provide 
a  mechanism  for  incremental  learning  of  concepts.  Ex¬ 
periments  with  a  portion  of  the  National  Cancer  Insti¬ 
tute’s  SEER  tumor  registry  are  beginning  and  should  tell 
us  how  well  the  architecture  scales  in  the  face  of  truly  large 
databases. 

A  CALCULATION  OF  EVOCATIVENESS 


of  those  feature-tuples  in  the  database  that  are  “closest  to” 
the  r  mentioned  in  the  query.  To  do  this,  fetch  needs  an 
algorithm  to  cluster  the  values  returned  by  the  distance  cal¬ 
culation.  We  use  an  algorithm  developed  by  Mitchell  Sklar 
that  is  efficient  and  experience  has  shown  performs  reason¬ 
ably  well.  The  description  of  the  algorithm  that  follows  is 
taken  from  Sklar’s  medical  school  thesis  [2]. 

(We  can  use  a  routine)  CLUSTER  to  find  a  natural  break 
point  in  a  list  of  cases,  dividing  that  list  into  a  “close”  group 
C  and  a  “distant”  group  V.  Referring  to  a  list  of  numer¬ 
ical  distances,  the  algorithm  attempts  to  partition  the  list 
into  two  groups  a  and  ffj  such  that  the  sum  of  the  squared 
deviations  within  the  groups  is  locally  minimized.  That  is, 
CLUSTER  attempts  to  find  a  local  minimum  for 


The  calculation  of  evocativeness  attempts  to  determine  how 
strongly  a  presenting  case  r  brings  to  mind  a  value  for  the 
current  goal.  What  we  would  really  like  to  measure  is  how 
much  information  about  the  goal  we  gain  from  r.  Does  r 
strongly  suggest  only  one  goal  value,  or  does  it  bring  to  mind 
ten  goal  values  which  are  equally  likely?  One  way  to  measure 
this  information  content  7  is  to  relate  it  to  the  entropy  D  of 
a  probability  distribution. 

Let  the  total  number  of  goal  values  found  in  the  top- 
cluster  Q  be  T.  Assuming  that  the  probability  p,  of  goal 
value  i  being  correct  is  proportional  to  the  number  of  times 
n,  it  occurs  in  the  top-cluster,  we  can  calculate  the  entropy 
(disorder)  of  the  distribution: 


^(i.  -  if  -t-  -  fff- 

•=i  j=i 

Computationally,  however,  this  calculation  is  inefficient.  In¬ 
stead,  we  can  note  that  if  p  represents  the  mean  value  of  ail 
the  distances  x,  and  yj  combined,  then 

Dx.-m)^  = 

•  si 


m  /  V  2 
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The  entropy  function  D  ranges  from  a  value  of  0,  occurring 
when  only  one  goal  value  is  represented,  to  In  N,  where  N 
represents  the  total  number  of  possible  values  for  the  goal  in 
the  database  M.  We  can  scale  the  entropy  to  range  from  0 
to  1  simply  by  dividing  by  In  A^: 


E,  ,2  mn^ 


Similarly, 

n 

X!(P^-P)'  -  ’  (m-i-n)^ 

Combining  these  results  gives 
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We  can  further  adjust  the  scale  so  that  a  scaled-entropy  of 
0  corresponds  to  the  maximum  evocativeness  allowed  by  the 
system,  while  a  scaled-entropy  of  1  corresponds  to  the  min¬ 
imum  evocativeness  allowed  by  the  system.  By  setting  the 
endpoints  of  the  scale  far  enough  apart,  we  can  coerce  a 
particular  evocativeness  value  to  an  integer  without  mean¬ 
ingfully  reducing  precision,  and  therefore  avoid  the  cost  of 
storing  and  calculating  with  floating  point  numbers.  This  in¬ 
teger  is  the  final  evocativeness  number  E  used  by  the  FGP 
machine. 


E  =  S  (  ♦  nin  —  evoc*)  +  (1  —  5)  (  v  nax  —  evoc*) 

B  CLUSTERING  ALGORITHM 

fetch  maps  a  r  and  an  M  to  an  C:  given  a  feature-tuple 
and  a  database  of  feature-tuples,  it  produces  an  ordered  list 


Since  the  quantity  on  the  left  is  constant  for  the  given  list 
of  distances  no  matter  what  the  partition,  we  can  minimize 
the  total  intra-group  summed  squared  deviations  simply  by 
choosing  our  partition  to  find  the  first  maximum  for  the  last 
quantity. 


mn 

(m  -I-  n) 


(x  -  S)^. 


This  will  give  us  the  first  clean  break  in  the  distance  list. 
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1  Introduction 

Statements  of  statistical  modeling  objectives  are 
often  sufficiently  formal  as  to  provide  a  basis  for 
construction  of  precise  database  queries.  Specific 
examples  may  be  found  in  section  3  below;  a  form 
of  query  commonly  encountered  in  the  analysis  of 
longitudinal  data  is:  “obtain  all  ordered  p-tuples 
of  repeated  measurements  subsequent  to  (or  prior 
to)  the  occurrence  of  a  certain  event.”  For  cer¬ 
tain  kinds  of  modeling,  there  may  be  a  further  pro¬ 
viso  to  the  effect  that  the  measurements  selected 
should  have  been  obtained  at  certain  regular  inter¬ 
vals.  Depending  on  the  nature  of  the  study  under 
consideration,  resolution  of  such  queries  may  entail 
problems  of 

•  spacing  verification:  determining  that  the 
components  of  a  long  candidate  vector  are  ap¬ 
proximately  equally  spaced  in  time,  and  tak¬ 
ing  proper  action  in  the  presence  of  irregular¬ 
ities 

•  synchronization:  although  the  data  in  gen¬ 
eral  may  be  obtained  in  a  very  regular  fash¬ 
ion,  measurements  of  interest  to  the  analyst 
may  not  be  synchronized  from  subject  to  sub¬ 
ject;  instead,  the  "origin”  of  the  time-course  in 
a  variable  of  interest  may  be  subject-specific 
(depending,  for  example,  on  the  time  of  an 
exposure  event) 

•  attribute  linkage:  the  analyst’s  interest  in  the 
value  of  a  given  attribute  at  time  t  may  depend 
on  the  fact  that  t  was  the  first  time  at  which 
some  other  attribute  attained  a  certain  value. 

Such  an  enumeration  of  problems  arising  in  data 
manipulations  preparatory  to  longitudinal  analy¬ 


sis  is  hardly  exhaustive,  but  may  serve  to  con¬ 
vince  the  reader  that  no  direct  means  to  address 
these  problems  exists  in  standard  facilities  for  data 
manipulation  connected  with  statistical  analysis 
environments  (the  SAS  data  step,  S  matrix  ma¬ 
nipulation  facilities,  relational  databases).  Conse¬ 
quently,  users  of  these  environments  who  wish  to 
analyze  actual  longitudinal  data  are  often  engaged 
in  detailed  programming  tasks  to  extract  and  for¬ 
mat  data  with  the  required  longitudinal  properties. 
There  is  no  question  that  the  environments  are  ad¬ 
equate  to  support  such  programming,  but  the  pro¬ 
gramming  itself  is  expensive,  prone  to  error,  and  is 
often  thrown  away. 

Our  objective  in  this  report  is  to  consider  how  to 
reduce  programming  burdens  encountered  in  ma¬ 
nipulating  data  for  longitudinal  analyses.  Clearly, 
part  of  the  burden  will  depend  on  the  form  of  the 
permanent  data  store,  and  we  propose  a  “longi¬ 
tudinal  relation”  as  a  format  for  permanent  rep¬ 
resentation  of  observations  obtained  in  longitudi¬ 
nal  studies.  Our  major  concern,  however,  is  the 
formulation  of  a  programmable  query  idiom  which 
is  in  close  correspondence  to  the  statement  of  the 
modeling  objective.  This  formulation  has  not  been 
achieved,  but  we  will  discuss  a  working  function  on 
longitudinal  relations  which  solves  some  problems 
of  interest.  A  satisfactory  interface  to  this  func¬ 
tion  may  require  language  concepts  and  tools  not 
accessible  to  the  statistical  user  of  S. 

All  of  the  programming  related  to  this  essen¬ 
tially  conceptual  investigation  is  carried  out  in  S. 
The  ultimate  realization  of  the  objectives  explored 
here  would  likely  be  implemented  in  some  other 
language  or  database  function;  our  purpose  here  is 
in  establishing  broader  features  of  the  problem  and 
possible  solutions. 
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2  Longitudinal  Data 

The  basic  data  structure  we  are  concerned  with  is 
derivable  from  “panel  studies”,  “follow-up  cohort 
studies”,  “repeated-measures  studies”,  though 
some  of  these  terms  may  suggest  aspects  of  reg¬ 
ularity  or  data  balance  which  we  do  not  assume. 
There  are  I  subjects  pre,scnting  for  oliservations  re¬ 
peated  in  time.  For  the  moment,  confine  attention 
to  the  case  where  the  observation  i.s  scalar-valued, 
on  the  variable  A'.  Subject  i  provides  u;  measure¬ 
ments  on  A';  Tii  may  differ  from  subject  to  subject. 
Ti  is  the  rii-vector  of  unique  times  of  ol)servation, 
measured  in  some  convenient  scale  from  some  com¬ 
mon  origin;  ^Ti}  is  the  set  of  elements  of  T, .  For 
i  G  1,  • .  ■  t  €  {T,  },  we  denote  by  A',,  tb«'  value  of 
measurement  A'  obtained  on  subject  i  at  time/. 

Let  I  =  {1 . /},  T  =  U, {'/<}■  'll'''"  = 

J  X  T  is  a  natural  index  set  for  the  longitudinal 
data  A’it.  The  suitability  of  this  index  .s<‘t  to  actual 
use  for  organization  of  the  data  will  depend  on  the 
distribution  of  inter-visit  gaps  ami  on  the  variation 
of  n,-  with  i.  In  the  case  of  "equidistant"  (|T,j  — 
Tik\  =  c,  all  i  ^  k).  “balanced"  (u,  =  u.  all  /). 
“complete”  data  (no  mi.s.sing  observations),  every 
point  in  V  corresponds  to  a  unique  ilala  |>oint.  If 
only  the  “equidistant"  condition  is  dropp<'d,  J  x 
may  be  used  to  index  both  A'  and  the 
set  of  observation  t  imes. 

In  practice.  A'  is  vector- valued,  and  the  set  of 
components  of  A'  (and  of  course  the  values  of 
these  components)  may  differ  from  time  to  time 
(within  individual)  and  from  individual  to  individ¬ 
ual.  Furthermore,  etjuidistance  and  balance  are 
rarely  achieved  in  practice.  Th(>refore  the  simple 
indexing  schemes  just  discussed  will  be  useful  only 
if  considerable  sparseness  is  trilerated. 

3  Analysis  of  longitudinal 
data 

VN’i  mention  a  few  analytical  activities  relevant  m 
data  analysis  of  such  stmlies.  W'e  a.ssume  through¬ 
out  that  titne  is  measured  in  units  from  an  origin 
common  tf)  all  subjects 


3.1  Conditional  autoregression 

Let  time  be  measured  in  units.  Adopt  the  model 

p 

Vit  =  00  +  A':,!-*  +  +  c.,;  A,  (and  7) 

k  =  l 

may  be  vector- valued,  and  some  components  of  A', 
may  be  time-dependent.  We  refer  to  this  model 
as  autoregressive  with  order  p,  AR{p).  To  fit  the 
model,  the  ordered  p+  1-tuple  of  equidistant  mea¬ 
surements  on  V,  must  be  obtained  and  collated 
with  the  appropriate  elements  of  A\  to  establish 
the  contribution(s)  of  subject  i  to  the  outcome  vec¬ 
tor  and  predictor  matrix  to  be  submitted  to  a  re¬ 
gression  procedure.  If  the  subject  presents  more 
than  p+  1  measurements,  it  is  po.ssible  that  sev¬ 
eral  p  -I-  1-tuples  may  be  suitable  for  the  analysis, 
and  all  must  be  obtained.  See  Muiioz  et  al.,  (1988), 
for  example  and  further  references. 

3.2  Proportional  hazards  regression 

Writing  r{()  for  the  vector  of  time-dependent  co¬ 
variates  at  time  /,  the  model  for  the  hazard  of  an 
event  IS  /)(t|x(t))  =  /i(f|x(l)  =  Q)exp[x{1)d).  We 
con.sider  the  implementation  (agreg)  supplied  by 
riierneau  (1991)  to  STATLIB  for  use  in  S.  Par¬ 
tition  [0.7’,  „,)  itito  disjoint  intervals  on  each  of 
which  x,(l)  is  cotistani,  and  denote  the  such 
interval  by  [.s,j,i;,j),  We  may  assume  that  there 
are  11,  —  1  such  intervals  without  loss  of  getieral- 
ity.  'I'lie  recjuired  data  structure  for  the  contri¬ 
bution  from  individual  i  is  u,j.d[u,j),x,(t,j)), 
where  6{ii)  i.s  the  indicator  of  “event  occurs  at  time 
u  ,  S,j  <  i,j  <  u.j,  and  th  e  vector  £,(t]  is  constant 
on  [.s,j.  u,j]. 

3.3  Discussion 

1  hese  examides  are  emblematic,  and  the  data  ma¬ 
nipulation  activities  entailed  by  these  particular 
problems  arise  in  other  settings.  We  identify  a  few 
of  the  broad  features  of  the  "required”  data. 

•  The  'time-gaii”  siqiarating  observations  is  a 
crucial  datum;  serinences  of  gaps  may  play  an 
important  role  in  identifying  analyzable  con- 
t  ribul  ions. 
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•  Observations  on  different  attributes  (e.g.,  out¬ 
come  and  predictor  variables)  may  need  to  be 
“linked”  (for  extraction  purposes)  with  regard 
to  their  time  of  meeisurement.  The  nature  of 
linkage  may  be  complicated,  not  limited  to 
e.g.,  “simultaneity”. 

•  The  structure  of  the  contribution  need  not  be 
a  function  of  elapsed  time  only,  but  may  de¬ 
pend  on  the  values  taken  by  time-dependent 
va  ables;  the  time  at  which  such  variables 
take  on  certain  critical  values  may  constitute 
a  subject-specific  “origin”. 

4  Longitudinal  relation 

The  application  of  relational  database  techniques 
to  the  management  of  longitudinal  data  may  take 
various  forms.  We  define  a  longitudinal  relation  to 
be  a  relation  comprising  observations  as  described 
in  section  2  above,  with  the  ordered  pair  (?,  t)  as 
the  compound  key  for  the  relation.  As  an  example, 
we  provide  an  extract  from  a  hypothetical  cohort 
study  of  HIV  infection;  data  on  age,  markers  of 
infection,  and  infection  status  are  recorded. 


id 

date 

age 

cd4 

cd8 

HIV 

70328 

9597 

36,8 

338 

1222 

+ 

70328 

9772 

37.3 

542 

617 

+ 

70328 

10163 

38.3 

312 

1656 

+ 

70328 

10346 

38.8 

270 

1645 

+ 

70319 

8861 

38.3 

1021 

688 

- 

70319 

9049 

38.8 

785 

447 

+ 

70319 

9238 

39.3 

915 

826 

+ 

70319 

9412 

39.8 

848 

915 

+ 

The  longitudinal  relation  is  not  the  inevitable 
form  of  organization  for  longitudinal  data.  Often, 
“flat  files”  are  constructed  at  certain  stages  in  the 
study,  with  the  file  comprehend; 'iig  all  observations 
falling  in  a  certain  interval.  These  files  are  then 
subject  to  frequent  merging  and  subset  ting. 

A  unified  longitudinal  relation  may  be  awkward 
for  the  combination  of  attributes  varying  smoothly 
in  time  and  discrete  attributes.  For  example,  the 
HIV  attribute  above  has  values  in  each  of  the 
rows  for  subject  i.  but  these  n,  data  items  record 


the  single  piece  of  information:  “first  date  at  which 
infection  was  observed.” 

5  Processing  longitudinal  re¬ 
lations 

Our  approach  to  the  use  of  longitudinal  relations 
for  extracting  data  for  statistical  analysis  will  be 
illustrated  for  the  case  of  the  AR(  1)  model  (see  sec¬ 
tion  3.1.)  The  longitudinal  relation  is  readily  im¬ 
plemented  in  an  S  matrix  bearing  attribute-names 
as  column-names.  For  a  very  simple  A/i(l)  model 
for  change  in  cd4,  we  require  equidistant  pairs  of 
observations  on  this  attribute,  lagged  at  approxi¬ 
mately  six  month  intervals.  As  a  covariate,  we  em¬ 
ploy  cd8  metisured  at  the  lagged  time.  .A  possible 
solution  is  the  following  longitudinal  relation: 


id 

date 

cd4 

cd8 

date+ 

cd4+ 

cd8+ 

70328 

9597 

338 

1222 

9772 

514 

617 

70328 

10163 

312 

1656 

10346 

270 

1645 

70319 

8861 

1021 

688 

9049 

785 

447 

70319 

9049 

785 

447 

9238 

915 

826 

70319 

9238 

915 

826 

9412 

848 

915 

We  have  adopted  the  convention  that  var+  is  the 
value  of  var  at  the  “next”  time  as  required  by  the 
spacing  scheme.  Such  suffixing  is  naturally  itera¬ 
tive.  Having  obtained  such  a  longitudinal  relation 
in  an  S  matrix,  say  arlld,  the  S  command 

lsfit(  arlld[,c("cd4","cd8")] , 
arlld[,"cd4+"] ) 

is  one  way  to  estimate  the  parameters  of  the  model 
of  interest. 

We  have  implemented  an  S  function,  pairgen, 
to  carry  out  this  process.  The  user  must  supply  an 
“admission  function”,  which  operates  on  criterial 
variables  (typically  the  “time”  component  of  the 
key  is  a  criterial  variable)  to  indicate  which  rows  of 
the  input  longitudinal  relation  should  be  combined 
to  produce  an  output  longitudinal  relation  whose 
rows  possess  data  elements  satisfying  certain  time- 
dependent  conditions.  In  the  present  example,  the 


462  V.  Carey,  Y.  He,  and  A.  Munoz 


admission  function  specified  that  a  pair  of  observa¬ 
tions  is  to  be  admitted  to  the  output  relation  only 
if  the  times  of  the  observations  are  separated  by 
more  than  160  and  fewer  than  200  days. 

It  may  be  worth  noting  that,  at  least  for  sub¬ 
ject  70319,  the  output  relation  attributes  (date, 
date+)  are  a  partition  of  that  subject’s  observa¬ 
tion  time-line  as  would  be  needed  for  the  agreg 
analysis  mentioned  in  section  3.2.  It  is  straightfor¬ 
ward  to  generate  such  partitions  from  longitudinal 
relations  using  trivial  admission  functions. 

Because  the  pairgen  function  considers  a  vector 
of  criterial  variables,  it  addresses  the  problems  of 
synchronization  and  attribute  linkage  mentioned  in 
the  introduction:  the  condition  for  admission  of  an 
observation  to  the  output  relation  may  be  specified 
in  terms  of  arbitrarily  many  attributes  in  the  input 
relation. 


6  Discussion 

The  problem  of  effecting  transformations  of 
data  from  permanent  storage  ("archives”  or 
“databases”)  into  the  structures  reciuired  by  par¬ 
ticular  analytical  procedures  is  often  addres.sed 
by  programs  in  high-level  languages,  in  isolation 
from  actual  statistical  procedure-invocation.  Pro¬ 
grammed  statistical  procedures  are  used  essentially 
as  targets:  a  procedure  is  selected  in  accordance 
with  modeling  objectives,  the  input  requirements 
of  the  procedure  are  ascertained,  and  then  data  in 
the  permanent  store  are  extracted  and  transformed 
in  accordance  wit!]  the  input  requirements. 

This  extraction  and  transformation  process  is 
highly  error-prone  and  inspires  too  much  “throw¬ 
away”  programming  effort.  We  propose  that 
classes  of  modeling  objectives  be  identified,  that 
the  data  structures  needed  in  pursuit  of  these  ob¬ 
jectives  be  identified,  and  that  data  management 
systems  be  equipped  with  high-level  functions  de¬ 
livering  these  data  structures.  We  have  focused  at¬ 
tention  on  modeling  objectives  related  to  longitu¬ 
dinal  data  analysis,  and  have  identified  some  data 
structure  features  which  must  often  be  obtained 
in  performing  such  analyses.  While  it  is  obviously 
feasible  for  analysts  to  develop  ad  line  extraction 
and  formatting  procedures  to  facilitate  longitudi¬ 


nal  analysis,  we  have  investigated  the  possibility  of 
implementing  a  systematic  approach  and  our  re¬ 
sults  suggest  that  further  effort  may  be  profitable. 
The  chief  difficulty  we  face  in  making  this  function¬ 
ality  widely  usable  is  in  obtaining  an  interface  with 
a  natural  syntax.  The  appendix  presents  the  cur¬ 
rent  interface  to  the  pairgen  function.  This  func¬ 
tion  generates  regularly  spaced  longitudinal  pairs 
and  partitions  observation  time-lines  in  a  useful 
fashion.  A  similar  function  might  be  developed  in 
the  SAS  data  step,  based  on  the  LAGn  functions, 
or  in  the  programming  system  of  a  RDMBS. 


7  Appendix 

The  pairgen  function  takes  three  arguments,  a 
longitudinal  relation,  a  list  of  criterial  variables, 
and  an  admission  function.  The  list  of  criterial 
variables  is  a  subvector  of  the  “attribute-names” 
vector  of  the  longitudinal  relation.  The  admis¬ 
sion  function  must  be  written  in  terms  of  elements 
of  a  vector  of  criterial  variables  extended  to  em¬ 
brace  the  naming  convention  described  in  section 
5.  To  extract  from  the  first  relation  in  section  4  the 
pairs  representing  lags  approximately  six  months 
in  length  and  ending  with  cd8  values  lower  than 
800,  the  criterial  variables  are  identified  in  the  vec¬ 
tor  c("date",''cd8"),  and  the  following  admission 
function  might  be  used: 

function(x) 

x["date+"]  -  x["date"]  >  160  ft 
x["date+”]  -  xT'date"]  <  200  ft 
x["cd8+"]  <  800 
> 
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Abstract 

A  model  for  the  relation  between  multivariate  fourth- 
order  central  moments  of  a  set  of  variables  and  the 
marginal  kurtoses  and  covariances  among  these 
variables  is  used  to  produce  au  estimator  for  covariance 
structure  analysis  that  is  asymptotically  efficient  and 
yields  an  asymptotic  goodness  of  fit  test  of  the 
covariance  structure  while  substantially  reducing  the 
computations.  When  the  kurtoses  of  the  variables  are 
equal,  the  method  reduces  to  one  based  on  multivariate 
elliptical  distribution  theory,  and,  when  there  is  no 
excess  kurtosis,  to  one  based  on  multivariate  normal 
distribution  theory. 

Introduction 

In  covariance  structure  analysis,  the  p  x  p  population 
positive  definite  covariance  matrix  E  is  hypothesized  to 
be  a  function  of  a  q  x  1  vector  0  €  0  of  more  basic 
parameters  E  =  E(^).  An  asymptotically  efficient 
distribution-free  estimator  0  o{  6  can  be  obtained  by 
minimizing  the  quadratic  discrepancy  between 

F=  (s  -  <T(0))'f-’(s  -  ^(^))  (1) 

where  <t{0)  =  vecs(E(<T)),  s  =  vecs(S),  and  F  is  a 
(p*  X  p*)  weight  matrix  converging  in  probability  to  a 
positive  definite  matrix  F,  the  asymptotic  covariance 
matrix  of  s.  Here,  S  is  the  usual  sample  covariance  ma- 


Supported  in  part  by  USPHS  grants  DA0017  and 
DA01070.  This  manuscript  is  based  on  a  paper 
presented  at  Interface  ’91  (Seattle,  April  1991). 
Address  reprint  requests  to  P.  M.  Bentler,  Depart¬ 
ment  of  Psychology,  UCLA,  Los  Angeles,  CA  90024- 
1563. 


trix  based  on  a  sample  of  size  n  -f  1,  p*  = 
p(p-(-l)/2,  and  vecs  is  the  p*  x  1  column  vector 
formed  from  the  nonduplicated  elements  of  S.  Under 
the  null  hypothesis,  at  the  minimum  of  (1) 

nf  =  n(s  -  (r(?))'f“^(s  -  ~  X(p*_q)  (2) 

is  asymptotically  distributed  as  a  central  x^  variate 
with  (p*  -  q)  degrees  of  freedom,  and  the  asymptotic 
covariance  matrix  of  the  estimator  is  given  by 

^(0  _  0)i;X[O,  (A'F-Ia)-I],  (3) 

where  A  =  dff{0)/d9\  evaluated  at  the  true  value  6 
=  00,  Appropriate  regularity  conditions  for  (2)  and 
(3)  to  hold  are  given  by  Satorra  (1989)  and  others. 

As  shown  by  Browne  (1982),  the  elements  of  F  are 
given  in  the  distribution-free  case  by 

^ii,kl  =  ^ijkl  ~  ^ifkl ' 
where  a--  =  E(X,  -  -  /xy)  and  = 

E(X,  -  -  Hi)  for 

random  variables  X  =  Xj,...,X^  having  means 
Estimating  the  mixed  fourth-order  moments  tr-.i  re¬ 
quires  a  lot  of  computer  time  and  storage,  and  the 
moment  estimator  tends  to  be  unstable  in  small 
samples.  Hence  there  has  been  a  search  for  alternatives 
to  (1)  —  (4)  that  are  more  practical,  yet  retain 
asymptotic  optimality.  The  two  major  approaches  seek 
to  substitute  a  computationally  simpler  and  more 
stable  estimator  for  in  (4). 
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Simple  Elfficient  Models 

One  general  alternative  approach  has  been  to  determine 
conditions  under  which  the  matrix  with  elements 

can  substitute  for  (4)  with  no  loss  of  efficiency.  This  is 
the  form  that  (4)  would  take  if 

=  ‘^ifkl  +  ^ik^jl  +  ^tl^^jk  ’ 
which  holds  when  the  variables  X  are  multivariate 
normally  distributed,  or  have  no  “excess”  kurtosis. 
However,  (5)  can  be  used  without  loss  of  efficiency  with 
nonnormal  data,  provided  that  some  conditions  on  the 
model  and  parameters  are  met.  The  relevant  asympi- 
totic  robustness  theory  has  been  the  object  of  intensive 
recent  research  (e.g.,  Amemiya  k.  Anderson,  1990; 
Browne  &  Shapiro,  1988;  Satorra  k  Bentler,  1990). 
Although  conditions  for  asymptotic  robustness  have 
been  developed,  in  general  they  are  difficult  to  verify 
and  apply  in  practice. 


Another  approach  has  been  to  determine  conditions 
under  which  relatively  simple  extensions  of  (6)  would 
hold.  As  noted  by  Browne  (1982)  and  Bentler  (1983), 
under  multivariate  elliptical  distributions  (e.g..  Fang  k 
Anderson,  1990;  Shapiro  k  Browne,  1987) 

^ijkl  =  +  ^tl^jk)  ’  (7) 

where  17  =  represents  the  common  marginal 

kurtosis  of  the  i  =  1,...  p  variables.  Hence,  (1)  —  (3) 
can  be  applied  optimally  if  a  consistent  estimator  rj  of 
f]  is  used  in  (7)  and  hence  (4).  Such  estimators  are 
readily  available  and  thus  elliptical  theory  is 
implemented  in  standard  computer  programs  (e.g., 
Bentler,  1989).  When  77  =  1,  the  multivariate  normal 
form  of  r  in  (5)  applies,  and  computations  are 
simpler  still. 


In  practice,  the  assumption  of  homogeneous  kurtosis  for 
all  p  variables  as  made  under  both  normal  and 
elliptical  theories  is  excessively  strong.  Thus,  Kano, 
Berkane,  and  Bentler  (1990)  proposed  the  structure 

^ijkl  = 

(aya^/)try<r^,  +  (a,^a^/)<T,^tr_^./  +  jf.  (8) 

where  a-  =  a^-  are  parameters  arbitrarily  selected  to 

assure  that  F  with  elements  given  in  (4)  is  positive 
definite.  They  proved  that  use  of  consistent  estimators 


a  -  with  (8),  and  in  minimizing  F  in  (1),  yielded  the 
asymptotic  goodness  of  fit  test  in  (2)  and  the 
minimum  variance  estimator  with  covariance  matrix 
given  in  (3).  In  practice,  they  suggested  using  the 
structure 


(9) 

where  77^  =  estimates  of  p  marginal 

kurtoses,  along  with  covariances,  are  needed  to  imple¬ 
ment  (8).  If  the  marginal  kurtoses  77^  are  equal  for  all 
variables,  (9)  with  (8)  reduces  to  (7).  Thus  this 
methodology  generalizes  the  approach  based  on  ellipti¬ 
cal  theory,  while  requiring  no  heavier  computations. 

A  Simple  Kurtosis  Structure 


A  limitation  of  (8)  was  noted  by  Kano,  Berkane,  and 
Bentler  (1990).  Letting  C  =  A*E  (i.e.,  c-  =  ^77^17^’ 
Kano  et  al.  proved  that  a  necessary  condition  for  F  to 
be  positive  definite  is  that  C  is  positive  definite  and  the 
77j  are  all  positive.  While  the  latter  condition  is  not  re¬ 
strictive,  if  the  T)-  are  highly  variable,  the  structure  (9) 
might  not  be  consistent  with  a  positive  definite  F. 
Hence  (8)  would  be  an  inappropriate  kurtosis  model. 
For  example,  with  p  =  2,  if  77^  =  1,  77^  =  10,  and  E 
is  the  2x2  correlation  matrix  with  =  .6,  under 

(9)  C  would  not  be  positive  definite.  Here  we  give  an 
alternative  structure  for  (8)  that  is  more  widely 
applicable  than  that  based  on  (9). 


Let  77^  =  as  before.  In  addition,  let 

Then  we  have  the  following  result:  Under  (10),  the 
matrix  C  =  A*E  is  positive  definite.  Clearly,  c  -  = 

i.e.,  C  =  DED  where  D  is  a  diagonal 

matrix  with  d^^  =  .  Thus,  since  E  is  positive 

definite  by  assumption,  so  is  C  for  any  marginal  kurto¬ 
ses  of  the  variables. 


The  structure  (10)  thus  extends  the  applicability  of  the 
Kano  et  al.  (1990)  theory  to  a  wider  range  of  non- 
normal  distributions.  In  particular,  (2)  and  (3)  hold, 
provided  that  (8)  holds  with  (10).  The  counterexample 
based  on  (9),  given  above,  would  not  be  a  problem 
when  based  on  (10).  The  structure  (10)  also  represents 
a  generalization  of  the  elliptical  structure  (7).  That  is, 
if  77^  =  77^  for  all  variables,  substitution  of  (10)  into 

(8)  yields  the  structure  (7).  In  turn,  the  normal  theory 
relation  (6)  is  also  a  special  case. 
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Function  Simplification 

Kano,  Berkane,  and  Bentler  (1990,  eq.  11)  showed  how 
the  function  (1)  to  be  optimized  can  be  simplified  into 
a  computationally  more  efficient  form  under  the  kur¬ 
tosis  structure  (8).  Under  this  structure,  the  function 
specializes  into  a  form  that  avoids  computation  of  the 
large  (p*xp*)  weight  matrix  in  (1).  Because  the  pro¬ 
posed  structure  (10)  is  used  under  (8),  the  same  func¬ 
tion  simplification  as  described  by  Kano  et.  al  applies. 
These  authors  also  showed  how  a  yet  further  simplifica¬ 
tion  is  possible  when  the  model  <T(fll  meets  a  condition 
of  full  scale  invariance.  A  different  type  of  simplifica¬ 
tion  is  possible  under  the  newly  proposed  structure 
(10),  if  the  model  meets  the  ICSF  assumption: 

The  covariance  structure  <r(0)  is  said  to  be  invariant 
under  a  constant  scaling  factor  (ICSF)  if  for  any  posi¬ 
tive  number  a  and  ^  G  0,  there  exists  a  0*  E  O  such 
that  a(r(d)  = 

Under  the  ICSF  assumption,  Satorra  and  Bentler 
(1986)  and  Shapiro  and  Browne  (1987)  showed  that  a 
=  Ad,  for  some  vector  d.  Then,  under  the  kurtosis 
model  (8)  with  the  relation  (10),  the  general  matrix  F 
defined  in  (4)  can  be  written  in  the  form 

r  =  2K^(C®C)Kp-|-cc'  -  Add'A'  (11) 

where  Kp  is  a  known  matrix  such  that  <t  =  KpVec(E), 
and  where  c  =  KpVec(C).  Shapiro  (1986)  obtained 
the  result  that  if  F  could  be  expressed  in  the  form  F  = 
W  -1-  AGA'  for  some  symmetric  matrix  G,  then  under 
some  regularity  conditions,  at  the  minimum 

n(s  -  a(?))'W-l(s  -  <7(^))  ~  xjp,_q)  (12) 

and  the  estimator  6  is  asymptotically  efficient.  In  prac¬ 
tice,  one  uses  a  consistent  estimator  W  of  W  in  (12). 

It  is  apparent  that  (11)  is  of  the  form  required  for  (12) 
to  be  applicable,  with 

W  =  2K|,(C  0  C)Kp -f  cc'.  (13) 

Some  algebra  can  verify  that  the  function  (12)  to  be 
minimized  under  (13)  can  be  written  as 

F  =  itr([S  -  E(^)]C-^}2  -  5{tr[S  -  E(^)]C-‘}2  (14) 

where  6  =  (2p  -t-  4)'^.  The  advantage  of  minimizing 
(14)  rather  that  (1)  is  that  matrices  of  much  smaller 
order  are  involved.  Also,  since  (14)  is  a  variant  of  the 
form  given  by  Bentler  (1983,  eq.  3.13)  for  estimation 
under  elliptical  distributions,  only  minor  modifications 
to  standard  programs  are  needed  to  implement 
estimation  by  minimizing  (14). 
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Abstract 

In  Monte  Carlo  simulation  of  multivariate 
distributions,  it  is  often  helpful  to  use  a  general  class  of 
distributions  which  share  certain  deOning  characteristics 
but  which  allow  controlled  variation  of  other 
characteristics.  We  show  how  multivariate  kurtosis,  as 
measured  by  Mardia’s  coefficient,  can  be  controlled 
across  the  class  of  elliptically-contoured  distributions. 
This  allows  convenient  assessment  of  the  effects  of 
kurtosis  on  test  power,  robustness,  or  whatever  the 
Monte  Carlo  subject  of  interest.  We  illustrate  the 
method’s  utility  by  showing  that  common  tests  for 
skewness  are  also  veiy  sensitive  to  kurtosis  even  in  non- 
skewed  distributions. 

1  Introduction 

Elliptically-contoured  multivariate  distributions  are 
those  whose  equal-density  countours  are  ellipses 
(bivariate  case)  or  hyper-ellipses  (for  the  p>2  case.) 
Multivariate  normal  distributions  are  special  cases,  and 
elliptically-contoured  distributions  provide  one  approach 
for  organizing  departures  from  multivariate  normality. 
See  Chmielewski  (1981)  for  a  summary  and  review  of 
elliptically-contoured  distributionsand  their  contributions 
to  robustness  studies.  Johnson  (1987,  chapter  6)  describes 
an  easily  implemented  approach  for  generating 
elliptically-contoured  distributions. 

We  are  here  concerned  with  controlling  multivariate 
kurtosis,  as  defined  by  Mardia  (1970).  For  a  p-variate 
distribution,  Mardia’s  multivariate  kurtosis  coefficient  is 

/3^p=E[(X-M)"S■’(X-M)]^ 


For  the  univariate  case,  ^2 ,  and  bj ,  become  the  usual 
univariate  population  and  sample  kurtosis  coefficients. 

Mardia  (1970)  also  defines  population  and  sampie 
multivariate  skewness  coefficients,  /3,  p  and  b,  p.  These 
reduce  to  (//3,)®  and  (/b,)®  for  p=l.  Elliptically- 
contoured  distributions  are  not  skewed  by  any  reasonable 
skewness  criterion;  16^,  /3i,p=0. 

Largely  following  Johnson  (1987,  chapter  6),  an 
elliptically-contoured  random  vector  Y  can  be  generated 


X(p  «  1)  -  X  p)i2(p  X  1)  +  M(p  X  1) 

where 

R  is  a  non-negative  random  variable  with  finite 
variance; 

y  is  a  point  uniformly  distributed  on  the  unit  p 
hypersphere; 

R  and  U  are  independent. 

B  is  a  factorization  of  M=BB\  with  M  being 
proportional  to  X. 

In  this  scheme: 

E(Y)  =  n,  and 

Cov(Y)  =  X  =  (p’)E(R®)BBl 

The  special  case  of  spherically-contoured  distributions 
arises  when  B=al,p^p,.  Thus,  a  generation  scheme  for 
spherically-contoured  distributions  with  ^  =  0  and 
X=p  ’E(R®)I  is 


where  E(X)=i/  and  Cov(X)=X.  The  sample  analog  is 
b2,p  =  (n  ’)X|.,  „[(Xi-^)^X  ’(Xj-£)]®,  with  X  to  order  n 


2^(pxi)  ~ 


Multivariate  Kurtosis  Mil 


As  the  above  generation  schemes  imply,  any 
elliptically-contoured  random  vector  can  be  obtained  by 
an  affine  transformation  of  a  spherically-contoured 
random  vector. 

2  Controlling  Kurtosis 

Theorem  1:  For  spherically-contoured  distributions 
generated  as  X=Ry,  if  E[R*]  and  E|R*J  exist,  then 
=  p=*E(R^)/(E(R^)ll 

Proof: 

We  first  note  that  E(y)  =  0  and  Cov(y)  =  p  '1.  [See 
Mardia,  Kent,  and  Bibby,  1979,  p.  429,  for  both  results.] 
It  follows  that: 

E(X)  =  M  =  Q.  and 

Cov(X)  =  S  =  E|Ry(Ry)^l  =  p-'E(R")I. 

/3,,p  =  E[(X-u.yx\x-a)\^  = 

E{(Ry^[p  'E(R®)I] '(RU)}" 

/Sjp  =  p*E{(RW]^}/[E(R*)]=* 

Since  y^y=l  by  definition  of  a  point  on  a  unit  p- 
hypersphere, 

=  p^E(R>[E(R^)jl 

# 

Theorem  2:  For  elliptically-contoured  distributions 
generated  as  Y=RBy  +  y_,  if  E[R*|  and  E[R^|  exist, 
then  =  p^E(R>[E(R2)l^ 

Proof: 

Mardia  (1970)  shows  that  if  Y  is  an  affine 
transformation  of  X.  ihen  Thus, 

Theorem  1  also  establishes  Theorem  2. 

# 

In  application,  to  generate  an  elliptically-contoured  Y 
with  a  target  covariance  matrix,  Z,  establish  kurtosis  via 
selection  of  the  distribution  on  R.  Set  M=pX/E(R^),  and 
obtain  B  via  a  Cholesky  decomposition  of  M. 

Choice  of  the  distribution  on  R,  often  called  the 
"radius"  random  variable,  controls  also  controls 

all  higher-order  even  moments  of  the  multivariate 
distribution.  Since  for  establishing  kurtosis,  the  pertinent 
moments  are  E(R*)  and  E(R^),  it  is  often  more 
convenient  to  differentiate  among  elliptically-contoured 
distributions  in  terms  of  differing  distributions  placed 
on  R®. 


R®-[r(p/2^)]  leads  to  multivariate  normality. 
Distributions  which  depart  from  normality  in  kurtosis  but 
still  have  unbounded  support  can  be  obtained  by  using 
other  gamma  distributions  for  R®.  Distributions  with 
bounded  support  and  known  kurtosis,  can  be  obtained  by 
using  a  bounded-support  univariate  distribution  on  R^ 
such  as  a  beta  distribution,  p^^  can  be  easily  determined 
so  long  as  E(R2)  and  E(R^)  are  known. 

3  Example  of  Application 

Skewness  and  kurtosis  coefficients  are  commonly  used 
as  tests  for  both  univariate  and  multivariate  normality. 
For  instance,  Mardia  (1970)  describes  the  use  of  b,  p  and 
bjp  as  tests  for  multivariate  normality  However,  for  both 
the  univariate  and  multivariate  cases,  then  is  also  a 
widespread  presumption  that  these  tests  are  "diagnostic" 
in  the  sense  that  they  indicate  the  nature  of  departure 
from  normality.  Put  another  way,  this  amounts  to  a  belief 
that  the  skewness  tests  used  to  test  the  normality 
hypothesis  possess  an  additional  valid  interpretation  as 
tests  of  a  non-skewness  hypothesis.  For  instance,  Mardia 
(1970)  states:  "To  test  /5,  p=0  for  large  samples,  we 
calculate  A  [a  test  statistic  based  on  b,  p]  and  reject  the 
hypothesis  for  large  values  of  A"  (p.  523). 

Empirical  results,  however,  do  not  support  the 
presumption  that  skewness  tests  are  diagnostic.  For 
instance,  here  we  present  Monte  Carlo  results  showing 
skewness  test  powers  against  spherically-contoured 
distributions,  all  of  which  are  non-skewed.  We  use  six 
spherically-contoured  distributions  generated  as  X=RIJ. 
These  are,  defined  by  the  univariate  distributions  placed 
on  R: 

SCI:  R-[r(8p,l/8)]''",  yielding  ^2p=p(p-(- 1/8)  < 

^2,p(MVN); 

SC2:  R-[^(4p,l/4)]’'^  yielding  ^2p=p(p -I- 1/4)  < 
^2,p(MVN); 

SC3:  R-ir(2p,l/2)]’'^.  yielding /32p=p(p-t- 1/2)  < 
^2.p(MVN); 

SC4:  R-|r(p/2,2)j’'",  yielding  p=p(p/2)  = 

^2,p(MVN); 

SC5:  R-|^(p/4,4)l''^  yielding /32p=p(p-l-4)  > 

^2,p(MVN); 

SC6;  R-ir(p/8,8)]’'",  yielding  ^2p=p(p-»-8)  > 
P,,,(MVN). 

Note  that  distribution  SC4  is  multivariate  normal. 

In  this  study,  we  used  levels  of 
n=25,  50,  100; 
p=2,  5,  10;  and 
a=0.05,  0.10. 
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As  an  example  of  results,  Table  1  reports  results  only 
for  p==5,  a=0. 10.  Results  for  other  values  of  p  and  a=.05 
are  not  qualitatively  different.  Our  power  results  are  all 
based  on  1,000  replications. 

The  table  shows  results  for  four  skewness-based  tests. 
The  rows  referenced  as  b,  p(a)  are  powers  of  Mardia’s 
b,  p,  with  critical  values  obtained  from  an  (asymptotic) 
approximate  null  distribution  suggested  by  Mardia  (1970). 
Tlie  b,  p(e)  rows  are  results  for  b,  p  with  critical  values 
derived  empirically  from  10,000  multivariate  normal 
distributions.  Q,  is  a  skewness  test  suggested  by  Small 
(1980);  while  b,p  is  a  skewness  test  suggested  by 
Srivastava  (1984).  Critical  values  for  Q,  and  b,p  were 
obtained  from  the  asymptotic  distributions  suggested  by 
those  authors.  Conceptually.  Q,  tests  for  skewness  in  any 
of  the  p  marginal  distributions,  while  b,p  tests  for 
skewness  in  any  of  the  p  principal  components.  These 
are,  therefore,  both  more  specific  and  less  comprehensive 
skewness  tests  than  are  tests  based  on  b,  p. 

Results  in  Table  1  suggest  the  following  conclusions: 

1)  For  distributions  with  less  than  normal  kurtosis,  the 
skewness  tests’  detection  levels  are  deflated  well 
below  lest  size. 

2)  For  distributions  with  greater  than  normal  kurtosis, 
the  skewness  tests’  detection  levels  are  inflated  well 
above  test  size.  This  implies  that  a  "skewness"  test  has 
a  strong  probability  of  misdiagnosing  as  "skewed"  a 
nonskewed  distribution  with  high  kurtosis. 

3)  These  effects,  at  least  the  inflation  of  detection  levels, 
grow  more  pronounced  as  sample  size  increases. 
Thus,  they  do  not  appear  to  be  small  sample 
properties. 

This  effect  is  not  isolated  or  unique  to  our  study. 
Although  the  effect  has  often  been  overlooked,  to  our 
knowledge,  empirical  (Monte  Carlo)  studies  have  been 
unanimous  in  demonstrating  that  typical  "skewness"  tests 
have  detection  levels  strongly  inflated  (deflated)  by 
greater  than  (less  than)  normal  kurtosis.  Furthermore, 
other  studies  also  suggest  this  is  not  a  small  sample 
property,  but  rather  an  effect  that  grows  more 
pronounced  with  increasing  sample  size.  (See  Horswell 
and  Looney  (1991)  for  a  review  of  relevant  Monte  Carlo 
studies  and  additional  Monte  Carlo  results  of  ours  along 
these  lines.] 

4  Discussion 

The  poor  "diagnostic"  properties  of  skewness-based 
tests  stem  from  the  fact  that  the  tests  use  normality-based 
null  distributions.  Normal  distributions  are  not  skewed. 
However,  the  sampling  distributions  of  skewness 
coefficients  (or  skewness  test  statistics)  differ  greatly  over 
non-skewed  distributions.  For  instance,  the  sampling 


distributions  of  b,p  differ  greatly  across  the  six 
spherically-contoured  distributions  used  here.  If,  for  a 
given  n  and  p,  the  sampling  distribution  of,  say,  b,  p  was 
approximately  or  asymptotically  the  same  across  non¬ 
skewed  distributions,  then  a  normality-based  null 
distribution  of  b,  p  might  be  generally  useful  to  test 
hypotheses  of  non-skewness.  However,  this  is  not  the 
case.  A  more  extensive  discussion  of  this  problem 
appears  in  Horswell  and  Looney  (1991). 

Table  1:  POWERS  OF  SKEWNESS  TESTS 
AGAINST  SPHERICALLY-CONTOURED 
DISTRIBUTIONS  p=5 

Nominal  test  size  =  0.10 

Table  entries  are  per  cent  of  distributions  rejected 


n=25  n=M  n=100 
SCI 


KM) 

0 

0 

0 

KM) 

0 

0 

0 

Q, 

0 

0 

0 

K 

1 

0 

0 

SC2 

0 

0 

0 

‘>i,p(c) 

0 

0 

0 

Q, 

0 

0 

0 

'’.p 

0 

0 

0 

SC3 

KM) 

0 

0 

0 

KpiK 

0 

0 

0 

0, 

1 

1 

0 

bip 

1 

1 

0 

SC4=muUivariate  normal 

•>i.p(a) 

4 

7 

8 

l>i.p(e) 

11 

10 

10 

Q, 

10 

12 

9 

b,P 

6 

8 

10 

SC5 

KpiK 

37 

63 

81 

KpiK 

58 

69 

83 

0. 

32 

42 

48 

bip 

22 

35 

42 

SC6 

Kp(K 

88 

99 

100 

Kp(K 

94 

100 

100 

Q, 

72 

81 

86 

t’lp 

56 

71 

86 
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Abstract 

The  problem  of  inference  for  the  mean  of  a  highly 
asymmetric  distribution  is  considered.  Even  with  large 
sample  sizes,  usual  asymptotics  (i.e.,  normal  theory)  give 
poor  answers,  and  standard  modifications,  such  as  higher 
moment  correction  factors,  provide  little  help.  We  attempt 
to  develop  diagnostics  to  indicate  when  inferences  are  likely 
to  be  valid,  and  we  examine  the  performance  of  several 
modifications  to  the  standard  procedure.  The  problem  is 
illustrated  with  data  from  particle  physics. 

1.  Introduction 

When  interested  in  measures  of  central  tendency, 
robust  estimates  of  location  such  as  the  median  or  the 
trimmed  mean  are  commonly  recommended  if  data  arc 
asymmetric.  Occasionally,  however,  the  estimates  of  the 
population  mean  or  total  are  needed.  We  consider  an  exam¬ 
ple  from  particle  physics,  in  particular  from  the  use  of 
Monte  Carlo  to  simulate  neutron  transport.  As  the  resulting 
disuibutions  of  these  complex  processes  are  rarely  known, 
simulation  is  used  to  determine  the  values  of  physical  quan¬ 
tities,  such  as  average  flux  passing  through  a  region.  Thus 
the  mean  value  of  such  a  distribution  is  truly  of  interest.  In 
this  paper,  we  examine  performance  of  the  standard 
normal -tlieory  based  estimator  in  developing  confidence 
intervals  for  the  mean.  We  will  consider  modifleations  of 
the  standard  procedure,  as  well  as  compute  theoretical 
moments  for  the  appropriate  reference  distributions. 

2.  Nnnparametric  Confidence  Intervals 

Efron  (1988)  has  characterized  the  problem  of  non- 
paramctric  confidence  intervals  for  the  mean  as  follows;  "In 
one  sense  this  problem  is  impossible,  since  modifying  F 
with  a  tiny  probability  of  X  being  enormous  ...  can  totally 
change  p.  without  ever  showing  up  in  most  samples  ...  On 
the  other  hand,  the  problem  is  ‘solved’  every  day  by  using 
the  standard  Student  /-intervals  ..."  We  will  examine  modif- 
cations  to  these  procedures  in  the  "naive"  case  in  which  we 
assume  nothing  about  the  data  other  than  that  it  is  non¬ 


negative  and  independent  and  identically  distributed  (i.i.d.). 

To  generate  positively  skewed  data,  we  use  the  abso¬ 
lute  value  of  a  Cauchy  random  variable.  It  has  the  density 

fix)  =  |-r-W  ,  x>0. 

If  we  censor  at  a  threshold  value  7',  the  resulting  random 
variable  will  have  all  moments  finite  but  arbitrarily  large. 

The  standard  nonparameuic  approach  to  confidence 
intervals  gives  intervals  with  nominal  coverage  rate  1  -  a  of 
the  form 

where  x  and  s  are  the  sample  mean  and  sample  sumdard 
deviation,  n  is  the  sample  size,  and  where  t«-i,(i-a/2)-i^  is 

the  appropriate  pereentage  point  from  the  t  -distribution  with 
n-1  degrees  of  freedom.  When  the  approximation  to  the  /- 
distribution  is  poor  the  performance  of  these  intervals  is 
degraded.  In  the  case  where  the  underlying  random  vari¬ 
ables  are  positively  skewed,  the  estimates  of  mean  and  vari¬ 
ance  will  be  biased  low  and  correlated.  This  results  in  inter¬ 
vals  that  often  miss  the  true  mean  on  the  low  side.  Figures  1 
and  2  illustrate  this.  Figure  1  is  a  plot  of  1000  pairs  of 
(J ,  s),  each  pair  from  a  sample  of  size  1000  from  a  Gaus¬ 
sian  random  variable  with  mean  6.5.  The  envelope  formed 
by  the  two  diagonal  lines  contains  the  pairs  which  result  in 
confidence  intervals  which  cover  6.5;  about  95%  of  the 
points  lie  within  this  envelope,  as  expected.  Figure  2  shows 
1000  pairs  of  the  same  statistics  for  samples  of  size  1000 
generated  from  the  absolute  value  Cauchy  distribution,  cen¬ 
sored  at  r  =  1(XXX)  (with  mean  6.5).  Note  first  that  each  axis 
in  in  log  scale,  resulting  in  curved  envelope  lines.  The  dis¬ 
tribution  of  X  and  s  is  no  longer  elliptical,  and  significant 
biases  and  correlations  are  present.  In  fact,  only  about  60% 
of  the  points  result  in  intervals  which  cover  6.5.  In  most 
cases  1000  is  considered  a  large  sample  size  but  here  it  is 
clear  that  the  /  -approximation  is  a  bad  one,  and  that  n  is  too 
small. 


Estimation  in  Skewed  Data  471 


Modifications  to  the  standard  procedure  often  try  to 
better  characterize  the  distribution  of  t  =  Vn  (r-|i)/j  (e.g., 
Johnson  1978).  Following  Hall  (1983),  an  Edgeworth 
expansion  of  the  distribution  of  t  can  be  inverted  to  obtain  a 
modified  confidence  interval.  The  modifications  involve 
3rd-  and  higher  central  moments  of  the  underlying  distribu¬ 
tion;  in  the  examples  considered  here,  only  the  first 
modification  (using  3rd-moment  terms)  improved  coverage 
rates.  That  modification  gives 

^  2^-1  .(1-0/2))  ±  (ii-l. (1-0/2) . 

where  ps  is  the  sample  3rd  central  moment.  This  interval  is 
the  same  length  as  before  but  is  now  asymmetric  about  jc, 
biased  in  the  direction  suggested  by  the  sample  skewness. 

Table  1,  reprinted  from  Pederson  (1991),  contains 
observed  coverage  rates  for  the  mean  of  an  absolute  value 
Cauchy  random  variable,  censored  at  10000.  For  each  of  the 
several  sample  sizes  considered.  800  independent  replica¬ 
tions  were  generated.  Both  the  standard  (denoted  by  std) 
and  3rd-moment-corrected  (denoted  by  3rd)  intervals  were 
computed. 

Table  1 

Obsevered  Coverage  Rates 
(nominal  level  =  0.95) 


is  usually  near  1  in  the  cases  considered  here.  Thus  when  y 
is  small  relative  to  1,  the  variance  of  t  will  be  near  that  of  a 
standard  nonnal.  For  the  simulations  from  Tabic  1 ,  y  for 
n  =  20000  is  0.26.  and  for  n  =  50000  is  O.l  1;  by  the  time  y 
has  reached  0.1,  coverages  are  near  nominal  levels.  Unfor¬ 
tunately,  an  estimate  of  y  based  on  sample  moments  docs 
not  appear  to  be  a  useful  diagnostic,  because  it  is  biased 
low.  Work  on  developing  useful  diagnostics  continues. 
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n 

Standard 

3rd-momcnt 

1000 

0.59 

0.66 

5000 

0.74 

0.79 

10000 

0.82 

0.84 

20000 

0.87 

0.92 

50000 

0.93 

0.94 

100000 

0.93 

0.94 

Rgurv  1 .  Sample  moments.  Gaussian  r.  v. 


The  standard  intervals  cover  at  below  the  nominal 
rate  for  sample  sizes  less  than  50000,  and  the  3rd  moment 
corrections  are  modest  at  best.  The  intervals  that  miss  arc 
almost  always  too  low,  as  expected.  These  results  suggest 
that  standard  confidence  interval  procedures  are  inappropri¬ 
ate  in  this  problem  for  samples  under  20000  in  size,  as  there 
is  considcrble  undersampling  of  the  tail  regions. 

3.  Diagnostics 

If  standard  modifications  are  limited  by  the  sample 
size  and  skewness  of  the  problem,  one  would  like  to  know 
when  such  a  situation  exists.  Geary  (1947)  computed  the 
first  four  semi-invariants  of  t,  from  which  moments  can  be 
obtained,  expressed  in  terms  of  the  first  four  moments  of  the 
underlying  distribution.  Pederson  (1991)  found  that  the  crit¬ 
ical  quantity  in  determining  the  convergence  of  t  to  a  stan¬ 
dard  Gaussian  random  variable  is  y,  the  squared  coefficient 
of  variation  of  s^.  To  first  order,  the  variance  of  t  is 

1  +  -^p^,  where  p  is  the  correlation  between  x  and  5^  and 
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ABSTRACT 

We  consider  estimating  the  mean  of  a  positively  skewed 
distribution.  It  has  been  noted  that  in  random  samples  the 
sample  mean  has  a  large  probability  of  falling  below  the 
mean  of  the  distribution,  because  of  such  skewness.  Various 
ad  hoc  procedures  have  been  proposed  to  correct  this  low 
coverage  of  the  mean  in  order  to  estimate  conservatively  long¬ 
term  exposure  to  contaminated  soils  at  toxic  waste  sites.  We 
propose  a  direct  estimate  of  the  mean  based  on  a  penalized 
empirical  loss  function.  This  loss  function  is  made  up  of  a 
squared  error  loss  plus  a  penalty  for  each  observation  that  falls 
above  the  estimate.  The  resulting  minimum  risk  estimate, 
called  the  penalized  mean,  is  derived  iteratively  and  shown  to 
be  biased  in  favor  of  greater  coverage. 

We  show  that,  asymptotically,  a  one-step  iterate  of  the 
penalized  mean  is  unbiased,  converges  almost  surely  to  the 
true  mean,  and  with  mild  assumptions  on  the  form  of  the 
penalty,  is  normally  distributed.  Based  on  a  penalized  loss,  we 
show  that  this  new  estimator  is  uniformly  better  than 
thesample  mean  when  sample  size  is  large.  The  simulation 
results  show  that  if  we  choose  the  penalty  constant  properly, 
the  new  estimator  has  the  same  coverage  as  an  upper 
confidence  limit  estimator  that  has  been  proposed  but  with  less 
variance  and  bias. 

1.  INTRODUCTION 

The  normal  distribution  has  long  been  the  standard  model 
for  the  development  of  statistical  theory.  Its  structure,  prop¬ 
erties,  and  cenuality  in  asymptotic  theory  allows  for  the 
construction  of  an  elegant  and  concise  theory  of  estimation. 
But  experiments  and  data  collection  often  result  in  measure¬ 
ments  that  are  inconsistent  with  an  assumption  of  normality. 
Statisticians  often  encounter  samples  for  which  a  few  very 
large  or  outlying  measurements  are  included.  Many  ad  hoc 
procedures  have  been  developed  for  the  practical  handling 
of  such  outliers,  and  a  debate  has  often  ensued  on  "whether, 
and  on  what  basis,  we  should  discard  observations  from  a  set 
of  data  on  the  grounds  that  they  are  ‘unrepresentative’, 
‘spurious’  or  ‘mavericks’,  or  ‘rogues’,  “  Barnett  and  Lewis 


(1978). 

More  modem  views  and  analysis  have  taken  a  different 
approach.  Rather  than  modifying  the  data  to  fit  the  prescribed 
assumptions  of  the  normal  theory,  much  work  in  the  past 
twenty  years  has  gone  into  the  development  of  estimation 
procedures  and  statistical  tests  that  are  either  resistant  to  the 
effects  of  these  outliers  or  robust,  that  is  relatively 
unaffected,  by  the  lack  of  adherence  to  underlying 
assumptions.  These  techniques  are  “becoming  a  core 
component  of  statistical  practice,’’  Hoaglin,  et  al.(1985). 

Tukey(1962)  proposed  that  outliers  could  be  explained 
through  the  use  of  ‘‘longer-tailed’’  distributions  as  underlying 
models.  Numerous  robust  estimators  of  location  in  symmetric 
distributions  have  been  proposed  based  on  unequal  weighting. 
In  Huber(1981)  and  Hampel, etal.(1986)  trimmed  means,  M- 
estimators,  L-estimators,  and  reweighted  estimators  have 
been  developed  using  theoretical  and  empirical  approaches. 

Fuller(1970, 1991)  has  investigated  simple  estimators  for 
the  mean  of  a  skewed  population  using  a  technique  suggested 
by  Charles  Winsor  and  studied  in  Tukey  and 
McLaughlin(1963)  and  Dixon  and  Tukey(1968).  In  this 
technique  the  largest  k  observations  are  replaced  by  the 
(k-(-l)st  largest  observation  and  similarly  for  the  smallest 
observations.  The  mean  of  the  resulting  sample  was  called  by 
Tukey  a  ‘‘Winsorized’’  mean.  Fuller  studied  this  estimator 
assuming  that  the  right  tail  of  the  distribution  function  could 
be  well  approximated  by  the  tail  of  a  Weibull  distribution. 

A  problem  arises  when  using  a  sample  mean  to  estimate  the 
mean  of  long-term  exposure  to  contaminated  soils  at  toxic 
waste  sites  under  the  EPA’s  Superfund  program.  Since  the 
underlying  distribution  is  positively  skewed,  the  sample  mean 
has  a  large  probability  of  falling  below  the  population  mean. 
This  results  in  a  consistent  under-estimation  of  the  population 
mean.  The  F.PA  Office  of  Emergency  and  Remedial  Response 
(OERR)  convened  a  Workshop  discussion  on  February  23, 
1990,  to  examine  methods  for  solving  this  under¬ 
estimation  problem.  In  the  workshop,  many  approaches  were 
proposed  such  as  stratifying  the  data  or  interpolating  the  data 
using  kriging,  polygon  methods  or  triangle  methods.  Advan¬ 
tages  and  disadvantages  of  these  methods  were  widely 
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discussed  in  ihc  Workshop.  But  no  final  conclusion  was 
made.  The  under  estimation  problem  wa.s  left  open.  An  upper 
confidence  limit  estimator  based  on  normal  theory,  that  is, 
UCL  =  X  +  1.96sdv(X),  was  temporarily  put  into  u.sc. 

In  order  to  correct  this  under-estimation  problem,  in  this 
paper,  we  propose  a  direct  estimate  of  tlic  mean  based  on  a 
squared  error  loss  plus  a  penalty  for  each  obsers'ation  that 
falls  above  the  estimate.  In  attempting  to  minimi/e  the 
average  of  such  penaii/ed  losses  wc  derive  a  new  estimator  of 
the  mean  of  a  positively  skewed  population  with  adequate 
coverage,  where  the  coverage  of  an  estimator  defined  as  the 
probability  of  the  estimator  greater  than  the  estimated  param¬ 
eter,  that  is,  P(  §  >  8  ). 

2.  A  New  Criterion - Penalized  Lo.ss 

Wc  define  a  new  criterion,  that  is,  penalized  los.s,as  follows; 

L(d,0)  =  (d-e)=  +X1,^,„,  (2.1) 

where  >  0 ,  =  o(l)  and  0  denotes  the  true  mean  in  the 

population. 

The  first  term  of  the  loss  function  is  the  square  error  loss. 
We  define  the  second  term  as  a  lack  of  coverage  lo.s.s.  We 
define  I  to  be  the  penalty  con.stant.  Wc  penalize  the  estimate  if 
it  is  less  than  0. 

Let  T(X)  be  the  eslimator  of  0.  Then  the  ri.sk  of  this 
e.stimation,  i.e.  the  average  lo.s.s,  is 

R(e)  =  E(  T(X)  -  0  P(  T(X)  <  0  ).  (2.2) 

This  risk  function  consists  of  two  terms.  One  is  the  mean 
squtirc  error  term  and  the  other  is  the  penalty  term.  Minimizing 
the  mean  square  error  pul  Is  the  estimator  towards  0.  Minimiz¬ 
ing  the  penalty  term  pulls  the  estimator  above  0.  Hence,  this 
risk  is  a  kind  of  balance  of  the  variance,  bias  and  coverage. 
Minimizing  this  risk  is  difficult  for  the  nonparametric  problem 
considered  here.  We  do  not  know  the  form  of  the  underlying 
distribution.  Wc  only  know  that  the  underlying  distribution 
is  positively  skewed. 

In  order  to  find  an  estimate  with  small  risk  under  penalized 
loss,  wc  define  a  penalized  empirical  loss  function  based 
directly  on  the  .sample. 

3.  A  Penalized  Empirical  Lo.ss  Function 

The  ri.sk  defined  in  equation  (2.2)  includes  a  term  for  a 
.square  error  loss  as  well  as  a  penalty  term  for  lack  of  coverage. 
A  .sample  based  approach  to  measuring  similar  ideas  can  be 
developed  from  an  empirical  loss,  like  that  seen  inma.ximum 
likelihood  or  M-cstimation. 

Define  an  empirical  lo.s.s  function  based  on  .sample  as 
follows; 

L*  (x,.  t)  =  (  X,  - 1  )=  -I-  2X  P(  t  <  X  <  X,) .  (3.1) 

Here,  the  penalty  wc  impose  is  proportional  to  the  probability 
that  X  falls  between  the  observation  and  the  estimate,  1.  The 


loss  defined  m  (2.1)  imtxises  a  penalty  if  the  estimate  falls 
below  the  parameter  0.  A  sample  based  approach  to  this 
penalty  would  impo.se  the  penalty  whenever  an  ob.scrvation 
falls  above  die  estimate,  t. 

Based  on  the  form  of  the  penally  term  in  (3.1),  die  more 
extreme  an  observation,  the  greater  the  penalty  it  imposes  on 
our  esumate.  To  minimize  the  penalty,  the  estimator  will  tend 
to  be  larger,  and  thus  have  a  small  probability  of  falling  below 
0.  With  this  sample  ba.scd  loss,  the  penalty  is  based  on  the 
underlying  distribution  of  the  population  and  not  on  the 
distribution  of  the  unknow  n  estimator.  TTic  constant  2  used 
in  the  penalty  term  is  only  for  the  convenience  of  calculation. 

The  empirical  average  loss.  i.e.  the  empirical  risk,  is 

R(t)  1/n  X  L‘(x  .  t)‘ 

=  l/n  51(x  - 1)’  +  1/n  Vp([<  X  <  x  ) .  (3.2) 

Wc  hope  to  find  an  estimator  of  0  ba;  ed  on  minimizing  the 
empirical  risk  (3.2)  such  that  it  has  small  risk  under  the 
penalized  loss  (2. 1 )  but  has  adequate  coverage. 

Wc  call  the  empirical  risk  in  (3.2)  a  penalized  cmpiric.al 
risk. 

4.  Penalized  Mean  and  Its  Large  Sample  Properties 

In  order  to  find  the  minimum  empirical  risk  estimator  of 
the  mean,  wc  have  proved  several  properties  of  this  penalized 
empirical  ri.sk  as  follows; 

(1)  R(l)  is  a  continuous  function  of  l. 

(2)  R(t)  is  pieccwi.se  differentiable.  The  derivative  does  not 
exist  at  t=  x,  but  R’(x)  and  R’(x')  have  the  same  sign,  i  = 
l,...,n. 

(3)  If  lf’(l)l  S  M,  then  the  minimum  value  of  R(t) 
cxi.sts  and  is  unique. 

Ba.scd  on  these  results  wc  have  the  following  theorem. 

Theorem  4.1.  Suppo.sc  ir(l)l  <  M.  ihcn  the  empirical  risk 
R(l)  is  minimized  by  the  solution  to  the  equation 

t  =  x-rU(t)(l  -F(t)),  (4.1) 

if  it  exists. 

The  .solution  of  the  equation  (4.1)  is  not  the  minimum  risk 
estimator  of0, since  the  cquation(4.1)includcs  unknown  pdf 
of  the  population.  In  order  to  find  the  estimator  wc  arc 
lookingfor,  wc  substitutcadensity estimator  Ktlintocquation 
(4.1),  and  define  a  new  estimator  of  the  mean  as  follows. 

Definition  4.1.  If  X ‘s  arc  i. i  d.  from  a  positively  skewed 
distribution,  then 

§=X-r\?(9)(l -F^O))  (4.2) 

i.s  called  penalized  mean,  where  f(*)  is  a  density  estimator  of 
X.  F^*)  is  the  empirical  distribution  function  of  X  and  is 
the  penalty  constant. 

Since  equation  (4.2)  defining  ©includes  an  estimate  of  the 
underlying  density  function,  it  may  appear  that  this  density 
estimate  would  provide  adequate  information  to  estimate  0, 
but  this  is  not  so.  Wc  need  an  estimate  of  the  density  function 
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only  at  one  point  which  we  approximate  by  ?(S). 

The  penalized  mean  is  defined  recursively.  So  it  can  be 
obtained  iteratively.  The  recursive  algorithm  is  the  following. 

Given^n  initial  estimate  6  , 

6,^  =  X  +  x!(6j(1-F(§,)).  k>0.  (4.3) 

It  is  not  difficult  to  prove  the  following  theorem. 

Theorem  4.2.  Let  f  (x)  be  a  density  estimator  of  f(x). 
Assume^  that 

(i)  IlKfll - >0,asn - ><»; 

(ii)  A.  - >  0  ,  as  n - >  oo 

and  be  the  solution  of  (4.2) .  Then  we  have 

A 

(1)  0^  is  an  asymptotically  unbiased  estimator  of  0  ; 

(2)  Assume  A  =  ©(n  ''^) ,  we  have  E(  0  -  0)^  <  O(n  ') ; 

(3) 0„-^>0: 

(4)  0__ 0  ; 

(5)  Assume  A,  =  o(n  ’'^) ,  then  Vn(  0 )  — —>  N(0,  a^). 

5.  One-Step  Iterate  of  the  Penalized  Mean 

Recall  the  definition  of  the  penalized  mean.  It  defines  the 
estimator  in  terms  of  itself.  Therefore,  the  penalized  mean 
forms  the  basis  for  a  recursive  algorithm.  If  we  choose  x  as  a 
starting  point,  the  simulation  results  show  that  most  of  im¬ 
provement  of  the  estimate  is  after  one  step.  We  define  the 
one^step  interate  of  pen^ized  mean  as 

0  =  X-KAj(X)(l-F(X)).  (5.1) 

Under  the  same  assumptions  given  in  Theorem  4.2  and  by 
using  the  same  approach  as  we  used  for  the  penalized  mean, 
we  have 

A 

(1)  §  is  an  asymptotically  unbiased  estimator  of  0; 

(2)  0  converges  to  0  almost  surely; 

(3) ifweassumeA,__  =  o(n  ''^),^  is  asymptotically  normally 
distributed  with  mean  0,  and  vari^ce  o^; 

(4)  0  has  better  coverage  than  X.  _ 

And  comparing  with  this  estimator,  we  have  proved  X  is 
inadmissible  in  a  class  of  nonnegatively  skewed  disuibutions 
under  the  penalized  loss  in  (2.1). 

6.  Simulation  Results. 

We  have  found  a  point  estimator  of  the  population  mean 
for  positively  skewed  distributions,  the  one-step  iterate  of 
penalized  mean,  under  the  penalized  loss.  But  choosing  the 
penalty  constant  is  still  a  knotty  problem.  In  point  estimation 
problems,  one  of  the  traditional  ways  to  evaluate  the 
goodness  of  a  point  estimator  is  to  check  unbiasedness  of  the 
estimator.  Now  the  problem  we  face  is  how  to  evaluate  the 
goodness  of  a  biased  estimator  with  adequate  coverage. 
Instead  of  constructing  a  frequency  distribution  of  the 
estimates  obtained  •  repeated  sampling  and  noting  how  closely 
the  distribution  centers  about  the  population  mean,  we  suggest 


checking  the  proportion  of  the  estimates  obtained  in  repeated 
sampling  that  fall  in  a  given  error  range.  We  evaluate  the 
goodness  of  our  estimators  based  on  the  following  four 
different  covejages. 


II 

s 

3 

V 

>o 

'w' 

(6.1) 

pl  =  P(-l/n<  -®><  1  ). 

(6.2) 

p2  =  P(- 1^<  -®)<2). 

(6.3) 

p3-P(-  l/n<  "*^<3). 

(6.4) 

Note  pO  is  the  same  as  the  coverage  defined  in  introduction,  and 
pi,  i  =  1,2,  3,  are  the  probabilities  of  the  estimator  falling 
i  standard  deviations  of  X  above  the  true  mean  with  an  error 
of  the  order  of  1/n  on  the  left  side  of  the  mean. 

In  order  to  slow  down  the  convergence  of  the  penalty  term, 
we  choose 

>/nloR  logCn)  C^*^) 

in(5.1),anachoosecsuch  that  our  estimator  has  the  same  PO 
as  UCL. 

Here  we  only  show  the  simulation  results  for  lognor- 
mal(0,l)  case.  The  similar  results  hold  in  other  common 
positively  skewed  distributions. 

Table  6.1  shows  that  the  pmean  has  better  coverages  than  X, 
and  has  almost  the  same  pO  coverage  as  UCL.  The  pl  and  p2 
coverages  of  pmean  are  much  greater  than  those  of  UCL. 
Pmean  also  has  less  mean  square  error  than  UCL.  Note  that 
coverages  pO  and  p3  of  pmean  in  all  cases  are  almost  the  same. 
This  means  there  is  no  long  tail  on  the  right  side  of  the 
distribution  of  pmean.  Obviously,  the  pmean  estimator  is 
superior  to  the  UCL  estimator.  But  the  optimal  penalty 
constant  c  is  still  under  investigation. 


Table  6.1 

Lognonnal  (0,1)  Case 
(100  repetitions) 


est. 

p.c. 

MSE 

PO 

PI 

P2 

P3 

X 

.1845 

.4000 

.3200 

.4200 

.4300 

UCL 

1.099 

.8600 

.2800 

.5700 

.6900 

pmean 

20 

.2424 

.8600 

.5300 

.8000 

.8700 

3 

(1 

LA 

esc. 

p.c. 

MSE 

PO 

PI 

P2 

P3 

X 

.1123 

.4100 

.3100 

.4100 

.4200 

UCL 

.5235 

.8300 

.2500 

.5200 

.6700 

pmean 

25 

.1601 

.8300 

.4800 

.7800 

.8400 

n  =  50 

est. 

p.c. 

MSE 

PO 

PI 

P2 

P3 

X 

.0930 

.4300 

.3100 

.4000 

.4200 

UCL 

.5446 

.9200 

.3100 

.5500 

.8200 

pmean 

36 

.1497 

.9100 

.4900 

.8600 

.9200 

NOTE: 

(I)  pc 

is  ihe  penally  constant  c  in 

(6.5); 

(2)  pmean  is  the  one-step  iterate  of  penalized  mean  defined  in  (S.  1 ); 

(3) UCL-X+  1.96sdv  (X) 


L 
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Abstract 

Diagnostic  classifications  based 
on  the  estimation  of  norm  devia¬ 
tions  are  based  on  homomorphic 
features  between  images  and 
pathology. 

The  measurement  is  the  three 
dimensional  mapping  of  a  tracer 
distribution,  which  reflects 
regional  myocardial  blood  flow. 
This  measured  distribution  or 
image  is  compared  to  a  normal 
distribution,  derived  from 
measurements  in  subjects  from  a 
population  in  which  the  myocar¬ 
dial  perfusion  is  assumed  to  be 
normal.  In  addition  to  the 
normal  (average)  distribution,  a 
measure  of  natural  variation  is 
made. 

The  comparison  between  a  test 
case  and  the  normal  population 
distribution  leads  a  measure  of 
deviation  from  the  norm.  The 
degree  of  deviation  is  quantita¬ 
tively  diagnostic  only  if  larger 
distribution  indicate  either 
more  advanced  disease  or  a 
higher  probability  of  disease. 
KEY  WORDS  :  Quantitative  classi¬ 
fication;  homomorphism; 

1  INTRODUCTION 

In  the  classical  approaches, 
medical  images  are  interpreted 
visually.  Abnormalities  are 
detected  by  comparing  the  find¬ 
ings  with  a  virtual  image  of 
normal  cases.  This  comparison, 
to  be  effective,  must  take 
normal  variation  into  account. 

In  great  part,  normal  variation 
is  due  to  biological  variation 
in  size  and  shape  of  normal 
subjects. 


Formal  quantification  in  Nuclear 
Medicine  has  usually  been  re¬ 
stricted  to  dynamic  parameters 
extracted  from  dynamic  images. 
In  this  approach,  the  image  is 
used  merely  to  define  sampling 
regions,  mostly  on  the  basis  of 
pattern  recognition  (e.g.  the 
generation  of  a  time  activity 
curve  from  a  region  of  interest 
drawn  over  the  renal  region) .  On 
the  other  hand,  quantification 
of  spatial  distributions  has 
been  hampered  by  the  inability 
to  spatially  match  untransformed 
images . 

2.  MATERIALS  AMD  METHODS 

The  data  are  tomographic  images 
obtained  after  the  intravenous 
injection  of  thallium  chloride 
at  the  end  of  a  stress  test. 
Since  thallium  is  a  diffusible 
intracellular  tracer  (  a  potas¬ 
sium  analog  ) ,  for  a  short,  but 
appreciable  time  following  the 
injection,  its  distribution  in 
tissues  is  proportional  to  the 
relative  distribution  of  cardiac 
output  in  those  tissues  (1). 

The  image  is  a  three-dimensional 
mapping  of  the  spatial  distribu¬ 
tion  of  the  tracer  in  the  myo¬ 
cardium.  The  sampled  value  is 
the  maximum  pixel  value  found 
across  the  myocardial  region. 

To  minimize  biological  variation 
due  to  size  and  shape,  and,  thus 
to  be  able  to  define  correspond¬ 
ing  points,  the  images  undergo  a 
polar  transform  as  follows:  The 
origin  of  the  polar  transform  is 
located  in  the  cavity  of  the 
left  ventricle.  The  latitude  in 
the  volume  image  is  represented 
by  the  angle  P  which  has  its 
origin  at  six  o'clock  (or  south 
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pole  or  apex  in  the  reoriented 
image) ,  and  is  mapped  in  the 
vector  as  r  =  SQRT[ (x-32) **2  + 
(y-32)**2],  where  P  =  r*64/135. 

The  longitude  is  represented  by 
the  angle  Th,  which  has  its 
origin  at  9  o'clock  in  the 
volume  image,  or  east  (septal)  , 
and  maps  in  the  vector  as  it¬ 
self,  again  with  the  origin  at  9 
o'clock. 

The  sample  value  found  in  the 
volume  image  along  the  radius 
(P,Th)  is  located  in  the  vector 
at  X  =  r*cos(Th)  +  32,  y  = 
r*sin(Th)  +  32. 

Radii  are  sampled  for  0<P<135, 
and  0<Th<360.  This  sampling 
follows  a  registration  of  the 
image  such  that  the  most  proxi¬ 
mal  part  of  the  basis  of  the 
heart  lies  along  the  radius  at  P 
=  135  for  any  value  of  Th. ,  and 
such  that  the  radius  with  angle 
P  =  0.  goes  through  the  apex, 
and  represents  the  long  axis  of 
the  heart. 

If  one  wanted  to  map  each  radial 
sampling  value  one-to-one  in  the 
vector,  one  would  be  allowed 
only  32  radii  for  P  (  over  135 
degrees) ,  and  a  variable  amount 
of  discrete  values  of  Th,  with 
the  maximum  being  256  at  Th  = 
135,  and  the  minimum  of  4  at  P  = 
135/64.  Since  this  would  lead  to 
under sampl ing  in  the  volume 
image,  the  mapping  has  to  be 
many  to  one. 

This  requires  a  special  strate¬ 
gy.  Classically,  the  sampling 
value  is  the  maximum  value  along 
the  radius.  Many-to-one  mapping 
needs  special  accommodation.  The 
initial  resident  value  at  (x,y) 
is  -1.  If  the  vector  (P,Th) 
which  maps  in  (x,y)  has  a  sam¬ 
pling  value  of  A  <  the  resident 
value  at  (x,y)  AND  >=  0,  the  A 
is  substituted  for  the  resident 
value  at  (x,y) . 

The  vector  A(x,y)  allows  inter 


patient  comparison,  since  it 
contains  only  directionality, 
but  no  information  on  size,  and 
minimal  information  on  shape. 

It  follows  that  one  can  con¬ 
struct  an  average  vector  Av(x,y) 
and  a  standard  deviation  S(x,y) 
from  patients  in  the  control 
group  (2) . 

Comparison  of  a  test  case  with 
the  normative  vectors  consist  in 
computing 

D(x,y)  =  (Av(x,y)-A(x,y) )/S(x,y) 
for  each  point  in  the  vector  A, 
and  a  sum  (SS)  of  D(x,y)  which 
represents  the  global  deviation 
from  the  normal  in  the  test 
case. 

The  transformation  and  sampling 
result  in  considerable  informa¬ 
tion  loss.  One  still  needs  to 
prove  that  SS  is  a  quantitative 
diagnostic  measure. 

One  approach  is  to  adapt  Bayes' 
theorem.  In  general  if  Se  is  the 
sensitivity,  Sp  the  specificity, 
P  the  prevalence  and  PP  the 
positive  predictive  value,  then 
PP=Se.P/(Se.P  +  (l.-P) (l.-Sp) ) . 
But,  this  formulation  presumes 
that  the  sign  or  symptom  is 
either  present  or  absent,  and 
not  quantifiable. 

However,  if  one  defines  the 
values  Sp  and  Se  for  increasing 
values  of  SS  (see  above)  ,  then 
the  measure  is  diagnostically 
quantitative  if  the  PP  does 
increase  with  increasing  values 
of  SS.  Alternatively  the  sensi¬ 
tivity  must  decrease  or  stay 
constant,  and  the  specificity 
must  increase  (  if  the  latter) 
or  increase  or  remain  constant 
(if  the  former) . 

The  test  population  consists  of 
a  cohort  of  serial  patients 
stratified  for  risk  of  coronary 
artery  disease.  The  stratifica¬ 
tion,  as  described  in  the  work 
of  Diamond  and  Forrester  (3)  , 

is  based  on  the  patient's  age 
and  sex,  the  nature  of  the 
symptoms,  and  the  degree  of  ST 
segment  depression.  Furthermore, 
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the  patients  are  classified  as 
well-tested,  if  the  end-point  of 
the  stress  test  was  reached  ( 
either  85%  of  maximum  predicted 
heart  rate,  ST  segment  depres¬ 
sion,  blood  pressure  drop  or 
significant  arrhythmia. 

The  patients  are  grouped  accord¬ 
ing  to  this  classification  in  9 
groups,  according  to  the  disease 
prevalences. 

The  prevalence  of  the  symptoms 
[P(S) ]  in  any  population  can  be 
expressed  as  the  weighted  sum  of 
the  prevalence  in  those  who 
have,  and  those  who  do  not  have 
the  disease: 

P(S)  =  Se.P  +  (l.-P) . (l.-Sp) 

This  equation  can  be  rearranged 
to  read: 

P(S)  =  P. [Se-(l.-Sp) ]  +  (l.-Sp) 

If  one  sets  P(S)  =  y;  P  =  x  ; 
[Se+( l.-Sp)]  =  a  ;  and  (l.-Sp)  = 
b,  the  we  obtain  the  regression 
equation: 

y  =  a.x  +  b. 

For  each  of  the  9  groups  one 
defines  y  as  the  frequency  of 
positive  outcomes. 

The  sensitivity  is  given  by  (  a 
+  b  )  and  the  non-specificity  by 
b,  the  coefficients  from  the 
regression  analysis  of  the 
paired  observations  x  and  y. 

3.  RESULTS  AND  DISCUSSION 

The  results  on  a  cohort  of  135 
cases  confirms  the  hypothesis. 
With  SS  =  0.,  Se  =  0.92,  and  Sp 
=  0.60.  When  SS  =  10.,  Se  =  0.74 
and  Sp  =  0.93.  The  linear  rela¬ 
tionship  measured  in  the  regres¬ 
sion  analysis  is  maintained 
until  Sp  =  1.00. 

At  that  time  one  would  expect 
that  higher  values  would  detect 
more  advanced  disease,  but  since 
the  analysis  is  based  on  rela¬ 
tive  distributions,  local  abnor¬ 
malities  tend  to  be  less  visible 
in  the  presence  of  global  dis¬ 
ease.  Homomorphy  is  present  for 
the  probability  of  disease  only. 
Unless  algorithms  can  be  devised 


to  match  images  by  elastic 
deformation  of  the  coordinate 
systems,  one  needs  to  achieve 
quantitative  image  analysis 
following  a  transformation  which 
eliminates  size  and  shape  infor¬ 
mation. 

More  importantly,  however, 
numerical  extraction  becomes 
quantitative  only  if  higher 
deviations  from  the  norm  have 
higher  positive  predictive 
values. 
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The  purpose  of  this  article  is  to  provide  statisticians  a  brief 
introduction  to  parallel  computing.  We  begin  by  discussing  a 
few  basic  notions  which  are  fundamental  to  parallel  process¬ 
ing.  Next  some  important  aspects  of  hardware  for  parallel 
computers  are  reviewed.  We  then  provide  a  brief  analysis 
of  system  performance  including  a  statistical  approach  to  the 
performance  of  one  kind  of  distributed  computing  system. 
Next  there  is  a  discussion  of  a  particular  form  of  parallel  it¬ 
eration  which  we  have  found  generally  useful  followed  by 
a  discussion  of  several  statistical  applications.  We  conclude 
with  a  review  of  some  of  the  difficulties  of  programming  par¬ 
allel  systems  and  mention  one  programming  system  we  have 
used  which  helps  overcome  these  problems.  We  recommend 
Bertsekas  and  Tsitsiklis  (1989)  for  the  reader  who  is  interested 
in  further  details  on  many  of  these  topics. 

1  Fundamentals 

There  are  a  small  number  of  key  issues  that  are  necessary  for 
the  understanding  of  parallel  computing; 

1 .  Sequential  processes, 

2.  Synchronization  of  parallel  processes,  and 

3.  Interprocess  communication. 

1.1  Processes 

A  fundamental  notion  required  for  the  understanding  of  paral¬ 
lel  processing  is  the  notion  of  a  sequential  process.  A  sequen¬ 
tial  process  is  the  actual  execution  of  a  sequential  program. 
A  sequential  program  specifies  the  sequential  execution  of  a 
list  of  program  statements.  Note  that  here  we  are  using  the 
terms  process  and  program  in  a  slightly  different  way  than  is 
customary.  For  our  purposes  here  a  program  (and  the  process 

•This  work  was  partially  supported  by  ONR  ConlraclN00014-91-J-1024 
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executing  it)  might  only  consist  of  a  small  number  of  or  even 
a  single  instruction. 

A  parallel  program  specifies  the  execution  of  one  or  more 
sequential  programs  that  can  be  executed  as  parallel  processes. 
Parallel  processes  can  be  implemented  in  one  of  three  ways. 

•  Multiprogramming:  The  processes  can  execute  on  a  sin¬ 
gle  processor  from  a  single  memory. 

•  Multiprocessing:  The  processes  can  execute  on  distinct 
processors  but  share  a  common  memory. 

•  Distributed  processing:  The  processes  can  execute  on 
distinct  processors  each  with  its  own  memory. 

Multiprogramming  is  merely  the  traditional  lime-shared 
version  of  “parallel”  processing;  each  user  seems  to  have  a 
private  machine  but  is,  in  fact,  sharing  a  single  machine  with 
many  others. 

The  execution  path  of  any  program,  parallel  or  not,  can  be 
represented  by  an  acyclic  directed  graph  called  the  process 
flow  graph.  Each  node  in  the  graph  represents  a  process  and 
each  directed  arc  represents  a  dependency  in  the  calculation. 
An  arc  from  node  i  to  node  j  means  that  the  process  at  node 
j  requires  the  result  of  the  process  at  node  i. 

1.2  Synchronization 

Synchronization  of  parallel  processes  is  required  in  order  to 
properly  execute  the  process  flow  graph  of  a  computation. 
This  conuols  the  cooperalion/interferenceof  one  process  with 
another. 

Consider,  as  an  example,  the  standard  Jacobi  iteration  for 
the  solution  x  of  a  homogeneous  linear  system 

=  AxC) 

where  A  is  a  square  matrix.  If  there  is  one  processor  for  each 
row  of  A  then  all  processors  must  complete  the  calculation  of 
the  dot  product  aj  x*’  ’  (where  aj  is  the  jth  row  of  A)  before 
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any  processor  can  start  the  next  iteration.  The  processors  must 
be  synchronized  at  the  end  of  each  iteration. 

There  arc  several  standard  programming  techniques  for  im¬ 
plementing  process  synchronization: 


2.  whether  the  machine  can  process  more  than  one  data 
item  simultaneously. 

The  four  resulting  types  of  machines  are: 


1 .  Shared  variables, 

2.  Semaphores,  and 

3.  Data  flow. 

A  more  detailed  discussion  of  synchronization  with  programs 
implementing  some  of  the  methods  is  given  in  Eddy  (1986). 

1.3  Communication 

Communication  among  the  processes  in  a  parallel  program  is 
handled  cither  by  the  use  of  common  memory  (in  a  shared 
memory  system)  or  by  the  use  of  a  communication  network 
(in  a  distributed  memory  system).  Typically,  if  the  system 
uses  shared  memory  then  communication  is  handled  through 
shared  variables  stored  in  the  common  memory.  Those  vari¬ 
ables  are  usually  accessed  with  hardware  instructions  such  as 
“load  and  lock”  and  “store  and  unlock.”  In  a  distributed  mem¬ 
ory  system  communication  is  handled  by  means  of  message 
passing. 

One  important  aspect  of  communication  is  whether  or  not 
the  communications  are  synchronized.  Synchronizaiionof  in¬ 
terprocess  communications  in  a  shared  memory  environment 
is  handled  through  the  use  of  the  memory  locking  mechanism 
indicated  above.  Synchroni7.ation  in  a  distributed  memory 
environment  is  typically  handled  through  the  use  of  I/O  in¬ 
structions  which  “wait  until  completion.” 

A  further  complication  is  that  synchronization  can  be  differ¬ 
ent  at  each  end  of  the  communication  channel  and  at  different 
times.  Each  time  an  actual  read  or  write  is  issued,  it  can  in¬ 
clude  an  implied  “wait  until  completion”  or  not.  In  the  case 
of  asynchronous  writes,  one  problem  is  that  a  large  number 
of  writes  may  be  issued  without  a  corresponding  number  of 
reads.  Consequently,  the  receiving  process  must  have  avail¬ 
able  a  nearly  unlimited  amount  of  buffer  space  to  store  these 
messages  until  the  receiving  process  is  prepared  to  read  them. 


2  Hardware 

2.1  Flynn’s  Taxonomy 

Flynn  (1966)  introduced  terminology  for  models  of  computa¬ 
tion  which  has  become  standard,  although  it  is  unfortunately 
imprecise.  Flynn’s  scheme  is  cross-classification  of  hardware 
on  the  basis  of  two  attributes: 

1 .  whether  the  machine  can  process  more  than  one  instruc¬ 
tion  simultaneously; 


INSTRUCTION 

Single 
Multiple 

The  SISD  machines  arc  traditional  sequential  computers, 
often  called  von  Neumann  machines  after  their  designer.  The 
SIMD  machines  occur  in  a  number  of  varieties,  the  most 
important  being  the  vector  processors  such  as  the  Cray  and 
the  systolic  machines.  There  are  really  no  practical  MISD 
machines.  The  MIMD  maCitincs  occur  in  two  varieties,  both 
of  which  are  very  important: 

1 .  the  distributed  memory  machines  such  as  the  hypcrcubcs 
and  other  networks  of  processors  with  local  memory,  and 

2.  the  shared  memory  machines  such  as  the  multiprocessor 
VAXes  and  the  Cray  XMP  and  YMP  machines. 

2.2  Memory  Hierarchy 

An  important  part  of  the  design  of  parallel  computers  is  con¬ 
trolling  the  flow  of  data  to  and  from  the  various  components 
of  storage.  We  naturally  think  of  a  hierarchy  relating  speed 
of  access  and  capacity  of  these  storage  elements.  Similarly,  a 
critical  problem  in  the  development  of  parallel  algorithms  for 
a  particular  hardware  environment  is  the  placement  of  data 
items  at  various  levels  in  the  memory  hierarchy.  While  the 
precise  elements  of  the  hierarchy  can  vary  substantially  from 
machine  to  machine,  a  fairly  general  list  of  the  available  levels 
includes 

1.  CPU  registers, 

2.  cache  memory, 

3.  local  memory, 

4.  distributed  memory, 

5.  disk  memory,  and 

6.  off-line  storage. 

As  one  proceeds  down  the  hierarchy,  data  items  take  ever  and 
ever  greater  amount  of  time  for  access  but  are  simultaneously 
available  in  greater  quantity.  Thus,  for  example,  the  most 
rapidly  accessible  items  are  those  stored  in  the  CPU  registers 
but  there  is  a  very  limited  number  of  them.  On  the  other  hand, 
data  stored  off-line  is  only  accessible  after  a  considerable  wait 
but  there  is  an  essentially  unlimited  amount  of  such  storage 
available. 
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2.3  Examples 


4  Statistical  Issues 


In  the  actual  presentation  at  the  meeting  a  variety  of  different 
hardware  systems  were  described.  These  included 

•  bus-connected  systems, 

•  cross-bar  switch  based  systems, 

•  mesh-connected  systems, 

•  shuffle-exchange  networks, 

•  hypercube  connections,  etc. 

Space  limitations  preclude  that  discussion  here.  The  inter¬ 
ested  reader  might  consult  Dongarra  et  al.  (1991)  for  similar 
examples  or  Duncan  (1990)  fora  more  technical  survey. 


3  Performance 

3.1  Amdahl’s  Law 

Amdahl’s  law  concerns  the  basic  fact  that  not  all  parts  of 
a  calculation  are  equally  amenable  to  processing  in  parallel. 
Assume  that  the  execution  of  a  program  requires  M  total  work 
and  that  a  certain  fraction  /  of  this  work  can  be  done  at  a  speed 
of  (sequential)  and  the  remainder  can  be  done  at  a  speed 
of  P~'  (parallel).  The  total  time  T  required  to  complete  the 
program  is 


T  =  /  ■  M  ■  S  +  (1  -  /)  ■  M  ■  P. 

Consequently,  the  effective  speed  R  with  which  the  program 
is  executed  is  given  by 


R  =  M/T  ^ 


1 

f.S+(l-f)P' 


this  is  Amdahl’s  law.  The  critical  implication  of  Amdahl’s 
law  is  that  the  time  required  for  the  execution  of  any  parallel 
program  is  bounded  below  by  the  time  required  to  execute  the 
sequential  portion  of  that  program  even  if  the  parallel  portion 
is  executed  infinitely  fast.  That  is. 


T>  f  -M  S. 


3.2  Load  Balancing 

There  are  two  possible  views  of  performance  in  a  parallel 
system.  One  view  (that  of  an  individual  user)  desires  to 
complete  a  single  job  in  the  shortest  possible  time  (minimizing 
the  makespan).  The  other  view  (that  of  a  system  manager) 
desires  to  keep  all  the  processors  of  the  system  as  busy  as 
possible. 


There  are  some  statistical  issues  that  arise  in  the  study  of 
distributed  system  performance.  These  relate  to  the  proces¬ 
sors  and  communication  channels  often  being  in  use  by  other 
applications  besides  the  one  of  interest  as  well  as  other  un¬ 
predictable  features  of  the  hardware  and  software.  There  are 
two  natural  ways  to  study  the  stochastic  performance  of  dis¬ 
tributed  systems.  One  is  to  create  statistical  models  for  the 
performance  of  the  system  and  the  other  is  to  collect  data  on 
the  actual  performance  of  a  system. 


4.1  Statistical  Models 

Suppose  that  we  wish  to  minimize  the  expected  time  to  com¬ 
pletion  of  a  task  (the  makespan).  We  need  to  divide  the  work 
among  the  processors  in  some  optimal  fashion. 

4.1.1  A  No-Cost  Model 

Consider  the  foUowing  simple  model  to  start.  There  is  a  task 
of  “size”  1  unit  and  there  are  p  processors  among  which  we 
can  divide  the  task.  Suppose  that  the  task  can  be  divided  into 
smaller  subtasks  whose  sizes  add  to  1  in  such  a  way  that  the 
time  it  takes  to  complete  a  sublask  of  size  /?  is  an  exponential 
random  variable  with  mean  j3.  Suppose  that  communication 
is  fast  enought  so  that  there  is  no  time  lost  between  subtasks. 
Suppose  the  task  is  divided  into  n  >  p  equal  subtasks  of  size 
(i  =  I /?j,  and  one  subtask  is  assigned  to  each  processor.  Each 
time  a  processor  completes  a  sublask,  another  one  is  assigned 
until  the  compulation  completes.  Under  these  assumptions, 
the  makespan  can  be  calculated  as  follows: 

1  1  -A  1  1  Inp 

-  +  -2Zt  =  -  +  —  1 

This  is  minimized  if  n  is  chosen  as  large  as  possible.  This  is 
clearly  nonsense  in  any  real  application. 

Consider  next,  the  case  where  the  number  of  subtasks  n 
is  fixed  and  suppose  we  are  interested  in  choosing  which  n 
parts  of  the  task  to  make  into  the  n  subtasks.  For  example,  if 
we  have  two  processors  and  we  can  only  divide  the  task  into 
three  subtasks,  we  can  still  choose  the  sizes  of  the  subtasks. 
Retain,  for  now,  the  zero  cost  assumption.  If  we  make  all  three 
subtasks  the  same  size,  then  the  makespan  is  2/3  from  (1).  If 
instead,  we  make  the  sizes  of  the  three  subtasks  1/a,,  i  = 
1,2,3,  we  can  still  calculate  the  makespan.  The  completion 
times  for  the  messages  are  exponential  random  variables  with 
natural  parameters  a,.  It  follows  that  the  makespan  is 


1  -f 
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+ 

02 

03  03  +  0, 

a. 

Oi+oj 

ai  -t-  Q2 
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For  each  value  of  03,  (2)  is  minimized  at  ai  =  02  =  a,  say. 
In  this  case,  the  makespan  is  (2a^  -  5a  +  5)/(2a^  -  2a), 
which,  in  turn,  is  minimized  at 


Q 


5  +  ^10 
3 


2.721. 


The  minimum  makespan  is  then  \/T0-  5/2  =  0.6623,  which 
is  not  a  great  improvement  over  the  equal  subtask  solution 
(=  .6667). 

In  general,  the  optimal  division  into  n  subtasks  will  have 
makespan  less  than  the  division  into  n  equal  subtasks  as  in 
(1).  With  exponential  distributions,  as  n  — >  oc  ,  the  makespan 
converges  to  1  /p,  the  minimum  possible  value.  Consequently, 
for  these  cases  we  believe  that  division  into  optimal  size  sub¬ 
tasks  will  not  substantially  improve  upon  the  simpler  division 
into  equal  subtasks. 


4.1.2  A  Random  Cost  Model 


We  could  assume  that  each  subtask  carries  an  overhead  r 
such  that  the  time  to  complete  each  subtask  of  size  J  is  an 
exponential  random  variable  with  mean  c  +  ,i.  For  simplicity, 
wc  will  assume  that  all  subtasks  arc  the  same  size,  so  that, 
if  there  arc  n  subtasks,  =  1/n.  Suppose  that  there  arc 
p  processors.  The  first  subtask  to  finish  takes  time  equal 
to  the  minimum  of  p  exponential  random  variables,  so  its 
distribution  is  exp(p/[c  +  cl]).  The  mcmorylcss  property  of 
exponential  distributions  gives  that  the  time  between  when 
the  i  -  1st  and  /th  subtasks  finish  also  has  fz7)(p/[r  + 

distribution  for  /  =  2 . n  —  /<  -f  1 .  For  i  =  1 . p  —  1 ,  the 

time  after  subtask  n  -  i  finishes  until  subtask  v-i  +  l  finishes 
has  ex])(i/[c  +  /)])  distribution.  The  makespan  is  then 


(c+.i) 


7/  -  P  +  1 


This  is  minimized  at  n  =  \/])A(ji)/c,  where  .l{p)  = 

Of  course,  the  mcmorylcss  property  of  the  exponential  dis¬ 
tribution  makes  it  an  implausible  model  for  running  times  of 
fixed  subtasks.  Other  factors,  such  as  network  traffic  and  pre¬ 
dictable  patterns  of  usage  in  a  distributed  system  also  make 
the  assumptions  of  this  model  seem  unrealistic. 

4.2  Empirical  Study 

If  one  is  concerned  about  the  performance  of  a  distributed 
system,  but  is  not  confident  in  any  of  the  simple  statistical 
models  which  one  can  construct  for  their  performance,  one 


can  collect  data  on  the  performance  of  the  system  by  giving 
it  various  problems  to  solve  which  are  similar  to  those  for 
which  one  will  want  to  use  the  system  in  the  future.  One 
can  vary  the  size  and  nature  of  the  problem,  the  number  and 
types  of  processors,  and  the  sizes  of  subtasks.  As  an  example, 
Eddy  and  Schervish  (1986)  report  on  a  case  for  which  the 
application  could  be  made  arbitrarily  large  and  for  which  the 
subtasks  could  be  made  very  small.  Two  different  configu¬ 
rations  of  distributed  system  were  used,  one  containing  eight 
processors  and  the  other  15  processors.  (The  systems  were 
heterogeneous  in  that  there  were  three  different  kinds  of  CPU 
represented  among  the  15  nodes.)  Figure  1  is  a  plot  of  the 
times  to  completion  of  many  runs  using  these  two  systems  vs. 
the  natural  logarithm  of  the  number  of  subtasks  (messages). 
We  sec  that  the  time  to  completion  is  relatively  insensitive  to 
the  number  of  subtasks  within  a  certain  range,  but  when  the 
number  of  sublasks  gets  very  large,  communication  bottle¬ 
necks  cause  inefficiencies.  When  the  number  of  subtasks  gets 
too  small,  excessive  lime  is  lost  waiting  for  the  last  subiask  to 
complete. 


5  Asynchronous  Iteration 

Consider  an  iterative  method  in  which  each  iteration  is  a  sub¬ 
stantial  computation.  A  typical  example  is  the  solution  of 
a  fixed  point  problem  by  successive  substitution,  where  each 
evaluation  of  the  function  is  time  consuming.  If  the  evaluation 
of  the  function  can  be  broken  into  subtasks,  each  iteration  can 
run  on  a  distributed  system.  Since  the  subtasks  are  performed 
asynchronously,  these  methods  are  called  asynchronous  iter¬ 
ation.  A  theoretical  problem  arises  concerning  the  conver¬ 
gence  of  such  an  iterative  method.  Each  “iteration”  of  such 
an  asynchronous  algorithm  is  not  the  same  as  an  iteration 
of  the  corresponding  synchronous  algorithm.  For  example, 
consider  the  following  iterative  method  for  finding  the  largest 
eigenvalue  and  corresponding  eigenvector  of  a  large  square 
matrix  .1.  Let  zo  be  a  non-zero  starting  vector,  and  let  ry 
be  the  ab.soluie  value  of  the  largest  ccxirdinate  of  j  q-  For 
II  =  0.  1 . 2 . define 


j  ti  —  .1  z ,,  ( 3 ) 

^'rt 

If  j'„+i  =  j  „,  then  that  vector  is  an  eigenvector  of  A  corre¬ 
sponding  to  the  largest  eigenvalue,  r„  .  Each  iteration  of  the 
usual  synchronous  algorithm  for  successive  substitution  con¬ 
sists  of  multiplying  the  result  of  the  previous  iteration  by  the 
matrix  .1  and  rc.scaling  as  in  (3).  If  .1  is  large,  it  might  make 
sense  to  have  several  processors  doing  different  pans  of  the 
multiplication  at  the  same  time.  For  example,  suppose  .1  is 
III  X  III  and  III  =  III)  -I-1112.  We  might  let  one prrxcssor  calcu¬ 
late  .1)  r,,  and  let  another  calculate  .  he,,  where  .1)  is  the  first 
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nil  rows  of  A  and  A2  is  the  last  m2  rows.  Similarly,  we  might 
split  .4  into  r7ii ,mp  disjoint  sets  of  rows  and  use  p  proces¬ 
sors.  Alternatively,  we  can  split  A  into  mj , . . . ,  m*  disjoint 
(or  even  overlapping)  sets  of  rows  and  use  p  processors  with 
p  <  k.  Now  the  question  naturally  arises  as  to  whether  we 
should  wait  until  all  k  partial  iterations  are  complete  before 
forming  Xn+i  or  should  we  form  a  new  x„+i  every  time  that 
we  learn  some  of  the  new  coordinates.  The  first  scheme  pro¬ 
duces  what  are  known  as  Jacobi  iterations,  while  the  second 
produces  Gauss-Siedel  iterations.  With  Jacobi  iterations,  the 
x„+i  which  results  after  all  k  partial  iterations  are  complete  is 
the  same  as  what  would  be  produced  if  the  entire  multiplica¬ 
tion  were  done  at  once.  In  the  case  of  Gauss-Siedel  iterations, 
precisely  which  vector  x  gets  multiplied  by  some  subset  of  the 
rows  of  v4  at  a  particular  partial  iteration  depends  on  which  co¬ 
ordinates  have  been  updated  by  the  time  that  partial  iteration 
begins. 

As  a  simple  illustration,  suppose  we  split  A  into  k  =  3 
disjoint  sets  of  rows  Ai,A2,  A3  and  we  have  p  =  2  proces¬ 
sors.  (For  convenience,  suppose  that  we  know  that  the  largest 
eigenvalue  is  1  so  that  we  don’t  have  to  divide  by  the  c„  val¬ 
ues.)  Let  Xq  ^  denote  the  coordinates  of  the  starting  vector 
xo  which  correspond  to  the  rows  in  /!,.  Suppose  processor 
i  is  assigned  the  subiask  of  multiplying  A^xq  for  i  =  1,2. 
Now  suppose  processor  2  finishes  its  multiplication  first.  Let 

x  j  stand  for  the  m2  coordinates  returned.  For  a  Jacobi  itera¬ 
tion  scheme,  we  would  now  assign  processor  2  the  subtask  of 
multiplying  A3X0.  For  Gauss-Siedel  iterations,  we  construct 

xi  by  combining  Xq'\  x\^\  and  Xg^'.  Processor  2  is  then  as¬ 
signed  the  subtask  of  multiplying  ^3X1.  These  two  subtasks 
will  not  produce  the  same  output.  For  Gauss-Siedel  iterations, 
we  can  assign  subtasks  in  a  simple  cyclic  fashion  1,  2,  3,  1, 
2,  3, . . .  until  some  convergence  criterion  is  met.  Each  time 
a  new  subtask  is  assigned,  the  vector  x„  consists  of  the  most 
recently  updated  values  for  all  coordinates.  When  processors 
have  widely  differing  speeds,  one  needs  to  be  careful  to  keep 
track  of  how  old  a  subtask  is  before  updating  coordinates.  For 
example,  suppose  that  processor  2  finishes  its  second  subtask 
before  processor  1  finishes  its  first  subtask.  Let  .43x1  =  X2^\ 
Using  the  cyclic  assignment  scheme,  we  would  construct  X2 
out  of  Xg ’,  xj^’,  and  Xj^’  and  then  assign  processor  2  the 
subtask  of  multiplying  A1X2.  Then  if  processor  2  finishes  its 
third  subtask  before  processor  1  finishes  its  first,  we  would 
have  a  value  x^"  =  A  1x2  which  would  supercede  the  re¬ 
sult  of  processor  1,  namely  Aixo,  when  processor  1  finally 
completes. 

There  are  conditions  (sec  Baudet  1975,  for  example)  under 
which  Gauss-Siedel  iterations  of  the  asynchronous  type  de¬ 
scribed  above  converge.  For  example,  let  F  =  ( Fi , . . . ,  F„) 
be  an  n-dimcnsional  function  of  n  variables,  and  suppose 
that  we  arc  seeking  a  fixed  point  of  F.  For  each  j,  let 


x->  =  (x^,...,x^)  be  an  n— dimensional  vector  represent¬ 
ing  the  iterate  of  an  asynchronous  algorithm.  Let  Jj 
represent  the  set  of  subscripts  i  (elements  of  { 1 , . . . , «})  such 
that  Fj  will  be  calculated  during  the  j*  iteration.  To  calculaic 
each  F„  we  need  to  choose  an  n -dimensional  vector  x  as  its 
argument.  Let  be  the  iterate  from  which  the  coordinate 
of  X  will  be  drawn  to  be  used  as  the  argument  of  each  F,  in 
the  j'*’  iteration.  To  summarize,  the  iterates  are  calculated  as 

J.J  _  /  Fi(Xj' , . . . ,  x'rT )  if  i  e  Lj  (4  j 

1  arp'  if  i^Lj. 

We  need  the  following  conditions: 

1-  <  i  -  1  for  all  j  and  i  in  order  to  guarantee  that  the 

scheme  does  not  require  future  calculations  to  be  done 
before  past  calculations. 

2.  limj  _oo  =  00  for  all  i  in  order  to  guarantee  that  coor¬ 
dinates  of  the  arguments  to  the  F,  functions  get  chosen 
from  newer  iterations  as  time  goes  on. 

3.  for  every  i,  i  6  Lj  for  infinitely  many  j  in  order  to  guar¬ 
antee  that  every  coordinate  is  updated  infinitely  often. 

The  theorem  proven  by  Baudet  (1975)  is 

Theorem  I  Let  F  :  R"  R”  satiny  |F(x)  -  F(j/)|  < 
A|x-i/|/c>ran  x  nmatrixA  with  spectral  radius  less  than  1 , 
where  absolute  values  are  to  be  understood  coordinatewise. 
Then,  under  the  three  conditions  described  above,  the  asyn¬ 
chronous  iteration  scheme  in  (4)  converges  to  the  unique fixed 
point  of  F. 

Chazan  and  Miranker  ( 1 969)  prove  a  similar  theorem  for  lin¬ 
ear  systems. 

Theorem  2  If  F[x)  =  Ax  +  b  with  b  non-zero,  and  the  three 
conditions  above  hold,  then  the  scheme  in  (4)  converges  if  and 
only  if  the  spectral  radius  of  A  is  less  than  1. 

For  more  general  discussion  of  parallel  matrix  compu¬ 
tations  the  interested  reader  should  consult  Gallivan  et  al. 
(1990)  or  Schendel(1984). 

6  Statistical  Applications 

Several  statistical  applications  which  made  use  of  the  system 
of  Eddy  and  Schervish  (1986)  were  described  by  Schervish 
(1988).  One  application  involved  a  calculation  of  the  sum 
of  a  very  large  number  of  terms,  each  of  which  required 
only  a  small  amount  of  computation.  This  is  the  application 
whose  running  times  arc  displayed  in  Figure  1 .  There  were 
38,266,040  terms  in  the  sum,  and  it  took  a  single  Micro  VAX 
II  computer  40,100  seconds  to  do  the  sum.  The  fifteen  node 
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system  took  4,303  seconds.  This  system  consisted  of  eight 
Micro  VAX  lls,  six  Micro  VAX  Is  and  one  VAX  11/750.  This 
system  was  estimated  to  have  the  computing  power  of  ten 
MicroVAX  IIs,  hence  reduction  in  running  time  by  a  factor 
of  .107  is  quite  good.  Another  benefit  of  the  distributed  com¬ 
putation,  compared  to  the  serial  computation,  was  numerical 
accuracy.  The  computation  was  divided  into  6129  subtasks 
of  6237  terms  each.  Single  precision  distributed  computa¬ 
tion  agreed  with  double  precision  serial  computation  to  five 
decimal  places,  whereas  single  and  double  precision  serial 
computation  differed  by  as  much  as  5%. 

Kim  and  Schervish  (1988)  analyzed  survey  response  of 
9566  inmates  in  order  to  try  to  model  criminal  careers.  Due  to 
the  fact  that  the  inmates  were  in  jail  at  the  time  of  the  survey, 
the  sample  had  serious  recognizable  bias.  The  likelihood 
function  was  complicated  by  the  need  to  correct  the  bias. 
Also,  a  hierarchical  model  was  fit,  which  required  performing 
a  numerical  integration  for  each  inmate.  Each  evaluation  of 
the  likelihood  function  took  57.5  minutes  to  compute  on  a 
VAXstation  II.  The  application  was  distributed  by  dividing 
the  inmates  into  subtasks  of  size  100  each.  Every  time  a 
value  of  the  likelihood  function  was  needed,  the  computation 
was  distributed.  Each  evaluation  of  the  likelihood  took  7.1 
minutes  on  ten  VAXstation  lls. 

Not  all  applications  benefit  so  dramatically  from  the  dis¬ 
tributed  computation.  Schervish  and  Tsay  (1988)  developed 
multiprocess  models  for  time  series  which  allowed  for  abrupt 
changes  in  level  as  well  as  outliers  at  every  time  period.  As 
time  goes  on,  more  and  more  combinations  of  possible  outliers 
and  level  changes  needed  to  be  considered.  After  each  time 
period,  the  probabilities  of  the  60  combinations  which  seemed 
most  likely  were  calculated  and  parameters  were  estimated  for 
such  combination.  The  60  combinations  were  treated  as  60 
subtasks  and  distributed  after  each  of  20  time  periods.  It  took 
a  single  MicroVAX  II  1391  seconds,  and  it  took  a  system  of 
six  MicroVAX  IIs  360  seconds.  In  this  application,  there  is  a 
significant  amount  of  work  which  is  not  divided  amongst  the 
subtasks  and  this  takes  as  much  time  for  a  distributed  system 
as  for  a  single  processor.  Amdahl’s  law  strikes  again! 

7  Programming 

7.1  Difficulties 

There  is  considerable  difficulty  attendant  to  writing  parallel 
programs.  The  most  formidable  obstacle  is  the  lack  of  fa¬ 
miliarity;  programmers  have  been  programming  sequential 
machines  for  decades  and  various  sequential  programming 
paradigms  are  well-known.  The  need  for  the  programmer  to 
understand  the  issues  related  to  synchronization  and  interpro¬ 
cess  communication  make  parallel  programming  inherently 
more  complex  than  sequential  programming.  There  is  the  fur¬ 


ther  difficulty  that  there  are  no  standardized  languages  akin 
to  Fortran,  Lisp,  Cobol,  C,  etc.  for  programming  parallel 
machines. 

A  significant  complicating  factor  is  that  unlike  a  program 
written  for  a  sequential  computer,  a  program  written  for  a 
parallel  computer  cannot  be  easily  “ported”  to  a  different  kind 
of  parallel  computer. 

7.2  Linda 

We  have  recently  begun  to  use  Linda  for  our  parallel  pro¬ 
gramming.  Linda  is  an  extension  to  existing  languages  which 
is  based  on  computational  model  assuming  a  shared  memory 
machine.  The  shared  memory  is  addressed  by  an  associative 
scheme.  I'he  particular  model  is  both  simple  and  easily  im¬ 
plemented  on  a  variety  of  real  architectures  and  real  program¬ 
ming  languages.  Consequently,  programs  written  in  Linda  are 
portable  without  change  across  hardware  environments.  They 
are  not  necessarily  efficient  in  the  various  environments. 

The  actual  implementations  of  Linda  are  handled  as  simple 
extensions  to  existing  languages  such  as  Fortran  and  C.  The 
version  that  we  have  used  is  C-Linda  for  a  disuibuicd  network 
of  processors.  There  are  four  exU'a  functions  added  to  the 
usual  C  language  for  implementing  the  shared  memory. 

1.  in:  remove  data  from  the  shared  memory; 

2.  our.  add  data  to  the  shared  memory; 

3.  rd:  copy  data  from  the  shared  memory;  and 

4.  eval:  evaluate  and  add  data  to  the  shared  memory. 

The  key  to  the  parallel  execution  of  the  program  is  the  function 
eval.  If  one  of  its  arguments  is  itself  another  function  then 
that  function  is  actually  executed  in  a  separate  process  on  a 
distinct  processor. 

To  demonstrate  the  simplicity  of  C-Linda  programming, 
below  we  give  a  “parallel”  version  of  the  classic  “Hello  world” 
program.  The  only  features  that  deserve  further  explanation 
arc 

1.  the  name  of  the  highest  level  routine 

2.  the  operator  “?” 

3.  the  method  by  which  entries  in  the  content  addressable 
shared  memory  arc  accessed 

real_main ( ) 

# define  NUMBER  30 
{ int  i,  hello  ( ) ; 

out ( ' 'number' ',0); 

for  {i=l;i<NUMBER;i-t-+)  eval  (helloti)  )  ; 

in ( ' 'number' ' , NUMBER)  ; 
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) 

hello (i) 
int  i; 

{int  j; 

printf (' 'Hello  world;  %d.\n'',i); 
in ( ' 'number' ' , ? j) ; 
out ( ' 'number' ' , j+1) ; 

) 

The  name  of  the  highest  level  routine  must  be 
real_main ( )  . 

The  operator  “?”  is  used  for  selecting  any  item  from  the  shared 
memory  using  an  associative  scheme.  For  an  item  to  match 
the  argument  of  an  in  function  it  is  necessary  that  it  match 
both  in  type  and  in  content  if  it  is  specified  without  the  “?” 
operator.  Thus  the  in  in  the  main  program  is  not  matched  by  an 
element  in  the  shared  memory  until  the  subroutine  hello  has 
been  executed  number  times.  Also  the  in  in  the  subroutine 
hello  is  matched  by  any  element  in  shared  memory  which 
has  a  variable  of  type  int  for  its  second  entry. 
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Figure  1:  Seconds  vs.  Ln  (s  of  Messages) 
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Figure  1:  Empirical  Study  of  System  Performance 
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Abstract 

Models  concerned  with  the  spatial  features  of  epidemic 
spread  are  often  defined  in  terms  of  a  nearest-neighbour  grid 
network  (Mollison  &  Kuulasmaa  1983).  It  is  strongly 
conjectured,  and  can  be  proved  in  certain  cases  (eg  Cox  & 
Durrett  1988),  that  the  infected  area  has  (asymptotically)  a 
well-defined  shape. 

The  present  work  concerns  computer  analysis  of  the  shape  of 
spread  of  a  discrete-time  single-parameter  infection  process 
on  an  eight  neighbour  lattice.  Data  from  such  simulations 
can  be  fitted  with  a  particular  group  of  three  parameters 
which  reveal  features  of  the  shape  of  the  expanding 
epidemic.  These  three  parameters  are  discussed  in  relation  to 
the  effect  of  the  basic  model  form  upon  them,  with  a  view 
to  a  framework  for  making  a  priori  statements  about  the 
various  members  of  a  more  general  model  class.  Such  a 
framework  would  allow  choices  between  various  grid 
epidemic  models  to  be  made,  where  in  the  past  such  choices 
have  tended  to  be  decided  arbitrarily  by  other  (convenience) 
factors. 

Introduction 

This  paper  considers  a  class  of  models  for  stochastic 
diffusion.  Historically,  this  type  of  model  has  had 
applications  in  the  modelling  of  epidemics  and  forest  fues;  it 
can  be  conjectured  that  it  can  cover  other  spatial  modelling 
situations,  such  as  rusting  or  tumour  growth.  The  class  is 
sufficiently  broad  to  allow  many  interpretations  and  many 
choices  between  assumptions.  Currently,  little  is  known 
about  the  effects  of  the  specific  choice  of  model  within  the 
class;  this  paper  attempts  to  establish  a  quantification  of  the 
behaviour  of  one  model  which  can  be  generalized  to  others 
of  the  class,  facilitating  comparison. 

General  Model  Form 

Following  Richardson  (1973),  define  the  broad  class  as 
follows: 

Starting  with  n-dimensional  Euclidean  space  S ,  impose  a 


(jartially-ordcred)  cell  division  T.  Each  cell  can  have  one  of 
two  states,  referred  to  as  White  and  Black.  The  cell  which 
contains  the  origin  is  coloured  black  at  time  0  and  all  other 
cells  are  white. 

Next  we  impose  G,  a  stochastic  growth  process,  which 
tends  to  change  white  cells  with  black  neighbours  into  black 
cells.  We  consider  CO ),  the  black  shape  at  time  t . 

The  most  basic  decisions  in  model  selection  from  this  class 
are  (i)  continuous  versus  discrete  time  and  (ii)  the  neighbour 
structure.  For  example,  we  could  have  a  continuous  time 
model  where  each  cell  emits  germs  as  a  Poisson  process  and 
these  germs  land  on  neighbouring  cells  by  some  rule; 
alternately,  we  could  consider  discrete  time  where  each  black- 
white  neighbour  pair  becomes  a  black-black  pair  at  time  step 
t  with  probability  p.  Choices  of  neighbour  suucture  include 
4  or  8  neighbours  for  cells  drawn  around  locations  with 
integral  coordinates  (the  so-called  Rook's  Case  and  Queen’s 
Case  models,  the  former  being  more  common  in  the 
literature).  These  structures  arc  quite  easy  to  simulate  by 
computer,  where  an  array  readily  represents  the  model  space, 
but  more  complex  tesselations  can  also  be  envisaged,  such 
as  6-neighbour  hexagonal  grids. 

Known  Results 

Some  authours  consider  more  complex  model  forms, 
especially  exua  states  for  a  cell:  typically  removal  (Cox  & 
Durrett  1988),  so  that  we  have  black,  grey  and  white  cells 
with  while ^grey  transitions  by  neighbours  and  grey-»black 
transitions  purely  by  lime.  We  can  even  consider  regrowih 
(black -awhile  by  neighbours,  Mollison  &  Kuulasmaa  1985) 
or  recovery  (without  immunity).  For  some  such  models, 
asymptotic  existence  results  have  been  proven;  as  a  typical 
example,  consider  the  Cox  &  Durrett  Shape  Theorem: 

Recode  black,  grey,  white  to  healthy,  infected  and  immune. 
Each  infected  sue  emits  'germs'  by  a  Poisson  process,  rate 
a,  uniformly  distributed  to  the  4  nearest  neighbours. 
Infected  sites  survive  for  a  length  of  time  following  some 
specified  distribution.  E>efine: 
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Cg  as  the  set  of  sites  that  will  ever  become  infected  from  the 
initial  infective  at  the  origin 
^  as  the  set  of  immune  sites  at  time  t 
as  the  set  of  infected  sites  at  time  t 

For  a  sufficiently  well-behaved  (finite  second  moment) 
infected->immune  process,  and  for  a  sufficiently  large, 

3  D,  a  convex  set,  s.t.  V  e>0, 

(1)  P[  C(,  n  t(l-e)D  c  C,  ct(l+e)D  V  suff  larger]  =  1 

(2)  P[  c:/(l+e)D  -/(l-e)D  V  suff  larger]  =  1 

Note  that  the  "well-behaved"  assumption  is  only  necessary 
for  (2).  The  process  considered  below  does  not  have  this 
property,  but  (1)  still  applies. 

This  result  is  typical  of  those  in  the  literature,  in  the  sense 
that  only  existence  is  proven,  and  only  in  an  asymptotic 
form  (r -»«>).  The  shape  of  D  is  not  known  beyond 
convexity  (and  obvious  symmetry),  and  the  form  of 
approach  to  this  "equilibrium"  shape  D  is  unknown. 
Specifically,  it  can  be  concluded  from  Durrett  &  Liggett’s 
(1981)  Flat  Edge  Result  that,  against  expectation,  D  cannot 
be  circular  for  many  cases. 

The  Process  Under  Consideration 

To  pursue  finite-time  results  for  such  models,  we  turn  to 
simulation  of  a  specific  process  as  follows: 

Time  is  to  be  discrete;  start  with  one  "ill"  individual  at  the 
origin  and  all  others  "well"  (as  usual)  on  the  lattice  of 
integer-valued  coordinates  in  R^.  Use  an  8-neighbour 
(Queen’s  Case)  connection  lattice.  At  each  timestep,  every 
ill  individual  attempts  to  infect  (separately)  each  well 
neighbour  with  fixed  probability  p  of  success.  Iterate  this 
procedure  for  T  timesteps. 

This  process  generates  observations  of  C(T)  (which  must  lie 
inside  [-T,T]x[-T,T]).  Re-code  each  element  of  C(T)  as  1  for 
ill  and  0  for  well,  and  take  the  mean  of  3000  such 
observations.  This  provides  a  collection  of  estimates 
S  (io;p,T)  =  P[(i  j)  infected  by  time  T] 

(note:  3(XX)  comes  from  the  fact  that  a  single  proportion 
estimate  has  SE  V^.  which  is  maximized  at  p=0.3  and 
has  a  value  close  to  0.01  for  p=0.5,n=3000,  so  we  obtain 
<1%  pointwisc  error) 

Investigating  the  Probability  Surfaces 

It  currently  takes  roughly  24  hours  of  processing  time  to 
generate  a  complete  set  of  surfaces  S(i,j;p,T)  for 


p=0. 1(0.05)0.9  and  T=5(5)60  -  a  total  of  204  simulations. 
This  is  small  enough  to  make  simulation  an  attractive  tool 
for  investigation  of  the  behaviour  of  this  type  of  model 

Shown  in  Figures  1  and  2  are  3D  plots  of  S  (i J;0.2,30)  and 
S  (i  j;0.6,30).  These  surfaces  exhibit  (surprisingly?)  large 
areas  for  which  S=l,  and  then  a  rapid  drop  to  S=0.  The 
shape  of  this  area  can  be  shown  to  be  non-circular,  its  nature 
is  most  readily  investigated  with  the  aid  of  interactive 
graphical  software,  such  as  Data  Desk  (Velleman  1990). 
The  usefulness  of  EDA  software  in  probabilistic  modelling 
has,  in  the  opinion  of  the  author,  yet  to  be  fully  appreciated. 
TTiis  subject  will  be  approached  more  comprehensively  in 
future  work. 

A  Functional  Form  for  S 

It  is  conjectured  that  the  form  of  S  (i  j;p,T)  for  this  model  is 
well  approximated  by  a  combination  of  a  logistic  curve  and 
a  measure  of  the  distortion  introduced  by  the  grid;  explicitly, 
the  suggested  form  is  as  follows: 


exp{b(ra(i,j)  -  a))  +  l 
where  ra(x,y)  =  ^|xj'“  +  |y|“ 

Note  that  this  form  involves  three  parameters  from  a  model 
which  initially  has  only  two;  any  redundancy  is  not 
currently  clear.  The  parameters  a,  a  and  b  are  described 
below. 

a  measures  the  distortion  from  the  circular:  a=2  gives  a 
circular  S  surface  because  r^  becomes  Euclidean  r,  a=» 
gives  a  square  S,  and  as  a  increases  from  2  distortion 
increases  monotonically.  This  distortion  is  akin  to  the 
cross-section  of  a  balloon  being  inflated  inside  a  box  - 
initially  circular  and  eventually  square. 

a  can  be  readily  interpreted  as  the  radius  of  the  0.5 
probability  contour  of  S  in  the  metric  generated  by  r„,  since 
when  r„=fl  we  have  S(iJ)  =  l/(exp(0)+l)=0,5 

b  is  a  measure  of  the  sharpness  of  the  change  from  S  1  to 
S  =0;  as  b  increases  S  becomes  more  like  a  step  function 
from  1  to0atr„=a. 

Results  from  Fitting 

Simulation  sets  have  been  collected  as  described 
(p=0. 1(0.05)0.9  and  T=5(5)60)  and  a  fitting  procedure  used 
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to  find  a,  a  and  b  for  each  data  set.  Currently,  the  method 
used  is  to  compute  the  sum  of  squared  residuals  between  the 
simulation  output  and  the  fitted  surface  at  each  of  the  grid 
points,  and  a  numerical  minimization  routine  is  applied  to 
the  surface  in  R^.  It  is  by  no  means  clear  that  this  is  the 
ideal  criterion  for  picking  the  a,  a  snAb  of  best  fit. 

A  section  of  a  scatterplot  matrix  of  fitted  values  for  a,  a  and 
b  against  p  and  T  is  given  in  figure  3.  Alongside  are  shown 
evaluations  of  some  suggested  closed-form  approximations 
for  a,  a  and  b  which  are  proposed  primarily  on  exploratory 
grounds.  The  forms  shown  are  computed  as  follows: 

a  =  2+1.58 

Y=  1  +  1.41  (p-pc)  Ta  [(1-p^)  p,  ] 

-  (p<Pc)  (•07(pc-p)  +  1.7(pc-p)^  +  3.35(pc-p)^) 
h  -  e-.28-4.0p+15.5p2-9.1p3  ^  e-0.04T 

or  more  succinctly 

a  =  2  +  ciln(T)p‘=2q‘^ 

a  =  T{  1  +  c(p-pc)T'*i"®“(®*P(P))  -  (p<pc)cubic(p)) 

I,  _  ecubic(p)  +  g-cT 

from  which  we  observe  the  following: 

a  behaves  like  1/q  so  that  we  get  the  correct  limit  of  <»  as 
p-»l;  it  is  not  clear  what  the  behaviour  should  be  as  p->0 
(it  is  unclear  whether  a  single  point  -  the  origin  -  is  circular 
or  square!),  and  it  is  hard  to  reliably  compute  a  for  very 
small  p  as  a  very  large  number  of  simulations  are  required. 
Note  that  there  is  a  very  small  upturn  in  the  simulation 
output  values  at  very  small  p  which  is  not  reflected  in  the 
suggested  approximation.  Also,  a  appears  to  behave  as 
log(T),  which  confounds  an  earlier  hypothesis  in  Lloyd 
(1991)  that  alpha  would  be  invariant  with  T. 

a  has  a  quite  complex  form,  although  essentially  each  term 
is  just  a  deviation  from  the  first,  which  indicates  that  aaT. 
Certainly  for  p  and  T  both  sufficiently  large,  a=>T  is  a  very 
good  approximation  indeed  ("large"  here  means  roughly 
p»0.5  and  T»30).  As  T  decreases,  there  is  an  additional  term 
of  the  form  1/T  -  the  complicated  exponent  for  T  is  just  a 
form  that  runs  from  p^  to  1  as  exp(p).  Finally,  for  p  ^low 
a  certain  value,  a  increases  linearly  with  T,  but  with  a  slope 
less  than  1.  The  final  term  accounts  for  this:  p^  is  that 
value  of  p  for  which  (it  is  hypothesized  that)  a  comes  out  as 


exactly  T  for  all  T.  Note  that  it  may  prove  interesting  to 
relate  this  critical  probability  (currently  estimated  at  roughly 
0.S1)  to  the  many  other  critical  probabilities  to  be  found  in 
the  literature  for  this  type  of  problem.  Above  p^,  a  must 
asymptotically  converge  to  T,  but  below  it  the  limit  is 
f(p)T.  The  cubic  form  is  a  crude  expression  of  f(p),  and  it  is 
very  much  hoped  that  a  form  with  more  meaning  can  soon 
be  found. 

b  splits  quite  simply  into  independant  functions  of  p  and  T, 
both  exponential  in  form.  Again,  a  cubic  for  p  fits  well  but 
can  hopefully  be  replaced  with  a  more  justifiable  function 
with  further  research. 

Finally  it  is  stressed  once  again  that  these  closed  forms  are 
exploratory  in  nature.  Once  similar  functions  are  available 
for  other  model  assumptions  (other  neighbour  and  time 
structures)  they  will  facilitate  direct  model  result 
comparison.  Specifically,  it  is  hoped  that  the  a  parameter 
will  provide  a  basis  for  discussion  of  the  distortion 
introduced  into  the  model  by  the  assumption  of  a  grid;  the 
model  form  with  lowest  values  for  a  would  be  the  most 
desirable  as  it  would  have  the  most  isotropic  behaviour. 
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Figure  1:  S  (i J;0.2.30)  Figure  2:  S  (i J:0.630) 


] 


Figure  3:  Observed  (L)  and  Rttcd  (R)  a,  a  and  b 
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1  INTRODUCTION 

In  this  preliminary  study  a  generalized  linear 
model  is  used  to  describe  the  conditional  proba¬ 
bility  of  a  tree  being  attacked  by  mountain  pine 
beetles  in  a  given  year,  given  the  characteristics  of 
the  tree  (e.g.,  size)  and  the  location  of  other  at¬ 
tacked  trees  in  the  stand.  The  model  is  used  to 
analyze  mountain  pine  beetle  attack  data  in  two 
lodgepole  pine  stands  in  Oregon  over  a  period  of  10 
years  (see  Fig.  1  and  2).  The  data  may  be  viewed 
as  a  realization  of  a  spatial  point  process  with  the 
probability  of  a  tree  being  attacked  dependent  on 
the  status  of  other  trees  in  the  stand.  Although 
full  maximum  likelihood  estimation  is  apparently 
not  feasible,  maximum  pseudolikelihood  estimates 
of  the  parameters  can  be  readily  calculated  with 
standard  statistical  packages  such  as  GLIM.  The 
pseudolikelihood  function  that  is  maximized  is  the 
product,  over  all  trees,  of  the  conditional  probabil¬ 
ities.  This  method  of  estimation  was  first  proposed 
by  Besag  (1975).  See  also  Strauss  and  Ikeda  (1990). 


2  AN  AUTO-LOGISTIC  MODEL 


Let  Yik  equcil  1  if  tree  i  was  attacked  in  year  k  eind 
0  otherwise,  where  i  —  1,  •  ■  • , k  =  I,  -  ■  ■  ,K  and 
nfc=  number  of  trees  that  have  not  been  attacked 
in  any  of  the  previous  years  l,---,k  —  1.  The 
probability,  pif-,  of  tree  i  being  attacked  in  year  k 
conditional  on  the  status  of  all  other  trees  in  the 
stand,  will  be  modeled  by 


Pik  -  Pr  [  Yik  =  1  I  yjk  ;  7  7^  »  ] 

=  Pr  [  P.it  =  I  l^,  dbhi,  vigi,  Dik] 

eVik 

~  1+eVik’ 

with 


(2.1) 


Vik  =  ak  +  0lDik  +  02  ^og(dbhi) 

+  03vigi  (2.2) 

=  (2.3) 


Fig.  1.  Stand  A  after  1980  attack.  -I-  =trees 
attacked  that  year,  o  =  trees  with  dbh>23cm. 


Fig.  2.  Stand  B  after  1984  attack,  -f-  =trees 
attacked  that  year,  o  =  trees  with  dbh>23cm. 
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where  dih, “diameter  at  breast  height,  uaj, “vigor 
of  tree  as  measured  by  the  amount  of  stemwood 
produced  per  square  meter  of  crown  leaf  area  per 
year  (Waring  et  al.  1980),  djj  =  distance  between 
trees  i  and  j  and  0  =  =  1,  -  A'} 

is  a  set  of  unknown  par2uneters.  The  variate  Djjt 
could  be  viewed  as  a  measure  of  the  density  of 
attacked  trees  surrounding  the  tree.  Djjt  is  large 
if  tree  i  is  near  other  attacked  trees. 

The  conditional  probability  model  in  (2.1)-(2.3) 
is  an  auto-logistic  model  (Besag  1974)  with  the  logit 
line 

r?(y)  =  o>i  +  '^0ijyj 

where  0ij  =  0j{  =  0id~^  and  o,  =  a  + 
02log{dbhi)  -|-  Auto-logistic  models  satisfy 

the  constraints  in  the  Hammersley-Clifford  theorem 
(Besag  1974)  which  guarantees  that  the  conditional 
probabilities  defined  above  are  consistent  with  a 
joint  probability  distribution. 

3  POINT  ESTIMATION 

Given  the  above  generalized  linear  model,  a 
maximum  pseudolikelihood  estimator  (MPE)  for 
the  unknown  parameter  vector  9  will  be  defined  as 
the  vector  0  which  maximizes  the  pseudolikelihood 
function 

K  nic 

nn  Privih  I  yjk ;  j  *] 

it=i  »=i 

A'  njt 

(3.1) 

k=l  i=l 

Pseudolikelihood  methods  were  first  proposed  by 
Besag  (1975)  for  estimation  of  parameters  in  a 
general  Markov  random  field  context.  Strauss 
and  Ikeda  (1990)  showed  that  for  a  logit  model 
similar  to  (2.1),  m2iximization  of  (3.1)  is  equivalent 
to  a  m2iximum  likelihood  fit  of  a  logit  regression 
model  of  the  form  in  (2.1)  with  independent 
observations  yn-.  Consequently,  estimates  can 
be  obtained  using  an  iteratively  reweighted  least 
squares  procedure.  Any  standard  logistic  regression 
routine  can  therefore  be  used  to  obtain  MPE’s  of 
the  parameters.  However,  the  standard  errors  of  the 
estimated  parameters  calculated  by  the  standard 
programs  are  not  directly  applicable  because  they 
are  based  on  the  assumption  of  independence  of  the 
observations. 


4  STANDARD  ERROR  ESTIMATION 

In  this  section  standard  errors  of  MPE’s  are  es¬ 
timated  using  a  parametric  bootstrap  procedure 
(Efron  1982,  1990).  An  iterative  sampling  scheme  is 
used  to  simulate  samples  from  the  joint  distribution 
X  =  Pr  j^yj  =  t/i,  •  ■  ■ ,  V'„  =  given  the  condi- 

tiontd  probabilities  pj  =  Pr  |y,’  =  yj|(yj;  j  ^  i), 
for  t  =  1,  •  ■  ■ ,  n.  The  sampling  scheme  is  as  follows; 
Starting  with  an  arbitrary  set  of  initial  values 

(yj^^  •  1  yn^^)i  generate  a  new  value  from  the 

distribution  of  yi|y2*^\  y3^\  •  •  • ,  yn^\  next,  generate 
y2^^  from  the  distribution  of  y2|yi^\  y3^\  •  •  ■ ,  yn^^ 
and  so  on.  up  to  y|/^  from  the  distribution  of 
Pnlyj^^y^'  > ■  ■  ■  >y|j\l]-  This  is  a  Markov  chain 
sampling  scheme  with  transition  probabilities  given 
by  the  conditional  probabilities  pj.  In  this  scheme 
only  one  variable  is  changed  in  each  transition 
and  after  n  transitions  we  arrive  at  the  sample 

(yi”^ ••  •  .yii”^)-  Hastings  (1970)  showed  that 
if  the  matrix  P  of  transition  probabilities  is 
reversible  and  irreducible  then  ir  is  the  unique 
stationary  distribution  of  the  Markov  process  P. 
For  the  present  data,  the  only  states  with  positive 
transition  probabilities  are  those  of  the  form  sq  = 
{y,  =  0,  Y-*  e  sf),  and  sj  =  {y,  =  1,  Y-’  € 
where  Y“*  =  {Vj  ;  j  ^  »)  and  sj“*  is  the 
state  space  (or  possible  outcome)  of  the  vector 
Y~’.  Therefore, 

’•'SoPSo.Sl 

=Pr[y,  =  0,  Y-  €  sp]Pr[y.  =  l|Y-‘  €  sj-'] 
=  Pr[Yi  =  0|Y-‘  6  si-']Pr[Y-'  G  sp] 

X  Pr[y,  =  l|Y-‘  G  sp] 

—PSl,So'’^Si 

In  other  words,  the  Markov  chain  is  reversible. 
The  chain  is  also  irreducible  because  in  any  given 
stand  the  distance  between  any  two  trees  is  finite 
and,  therefore  all  the  conditional  probabilities  are 
nonzero. 

Geman  and  Geman  (1984)  called  this  sampling 
scheme  the  ‘Gibbs  sampler’  and  developed  some 
general  results  about  the  convergence  and  rate  of 
convergence  of  the  joint  density  of  (yj^\  •  •  ■ ,  Vn"^) 
to  the  true  joint  density  of  (yi,-  -,yn).  For 
the  present  problem,  t  iterations  of  the  above 
sampling  scheme  replicated  M  times  will  produce 
M  independently  identically  distributed  samples 
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=  1,'  '.M),  from  the  distri¬ 
bution  TTj  that  has  if  as  its  stationary  distribution. 

Figures  3-6  are  plots  of  the  MPE’s  of  the 
parameters  calculated  after  each  iteration  using  the 
spatial  locations  of  trees  in  stand  A  (see  Fig  1). 
Each  iteration  involved  the  generation  of  n  =  576 
random  variates  (where  n=  number  of  trees  in  stand 
A).  Initial  values  were  generated  assuming  spatial 
independence  (i.e.,  assuming  a  logit  model  with 

=0).  Results  of  the  simulations  seem  to  indicate 
that  the  MPE’s  are  unbiased.  The  values  seem  to 
oscilate  around  the  actual  parameter  values  used  to 
generate  the  data.  Also,  the  rate  of  convergence  was 
very  fast.  The  rate  of  convergence  of  the  sampling 
scheme  did  not  appear  to  depend  on  the  initial 
values.  For  example,  use  of  p,  =  p,  {i  =  1,  ■  ■  ■  ,n), 
to  generate  initial  values  gave  the  same  results  (i.e., 
convergence  within  a  few  iterations)  as  the  more 
informati.  e  initial  values  used  above. 

5  RESULTS 

Data  from  the  first  three  years  in  stand  A  and 
the  fourth  year  in  stand  B  were  used  to  calculate 
two  sets  of  estimates  (one  for  each  stand)  of  the 
parameters  in  (2.2).  Data  from  the  remaining 
years  were  not  included  in  the  analysis  because 
the  numbers  of  attacked  trees  were  either  zero  or 
small  (<  10).  Table  1  lists  the  values  of  the  MPE’s 
and  two  estimates  of  their  standard  errors.  MPE’s 
were  crdculated  using  the  GLIM  statistical  package 
with  the  logit  link  and  binomial  error  options.  The 
standard  errors  from  simulations  are  the  standard 
deviations  of  6^,  (m  =  I,- -,200),  from  200 
simulations  using  the  sampling  scheme  described 
above.  For  each  simulation,  the  variates  generated 
after  20  iterations  were  used  to  fit  the  logit  model 
in  (2.1)-(2.3). 

In  both  stands  A  and  B  the  covariates  db/i  and 
D  seem  to  have  significant  effects  on  the  conditional 
probabilities.  Figures  7-8  are  contour  plots  of 
the  estimated  conditional  probabilities  versus  A  = 
Ijy/D.  For  a  given  tree,  its  distance  measure.  A,  is 
small  if  the  tree  is  close  to  other  attacked  trees.  The 
contour  plots  seem  to  indicate  that  the  probability 
of  a  small  tree  (dbh  less  than  15  cm)  being  attacked 
is  small  unless  it  is  close  to  other  attacked  trees. 
However,  large  trees  seem  to  be  attacked  even  when 
their  distance  measure.  A,  is  large,  i.e.,  even  when 
there  are  no  attacked  trees  nearby. 

Further  studies  that  are  in  progress  include  as¬ 
sessing  the  goodness-of-fit  of  the  auto-logistic  model 


and  the  use  of  alternative  measures  of  distance 
with,  perhaps,  more  biologically  meaningful  inter¬ 
pretations. 
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Table  1.  Pseudolikelihood  estimates  of 
parameters  euid  standard  errors  produced 
by  fitting  the  auto-logistic  model  to 
stand  A  1981  and  stand  B  1984  data. 


Parcuaeter  Estimate 


Stand  A 


Stand  B 


-20.30 

17.66 

6.24 

-0.003 

-12.60 

3.06 

4.54 

-.0013 


Standard  Error 
from 

Simulations  GLIM 


2.252 

2.018 

0.787 

0.007 

1.313 

1.124 

0.443 

0.008 


2.214 

2.264 

0.753 

0.009 

1.410 

1.008 

0.484 

0.007 


DBH=10cm  . DBH=15cm  - DBH=20cm  .. ..  DBH=30cm 


Fig.  3-6.  Parameter  estimates  from  simulations. 

Fig.  7-8.  Estimated  conditional  probabilities  of  attack. 
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Abstract 

A  major  source  of  problems  in  poor  fitting  environmen¬ 
tal  protection  equipment  is  the  lack  of  proper  consid¬ 
eration  of  the  possible  variation  among  aircrew  facial 
feature  sets.  The  design  of  flight  equipment  is  cur¬ 
rently  based  on  rather  outdated  mechanical  anthropo¬ 
metric  measuring  techniques  with  limited  justification 
for  specific  size  and  shape  characteristics.  This  paper 
discusses  the  preliminary  attempt  to  statistically  cap¬ 
ture  this  variability  in  an  organized  manner.  Three- 
dimensional  data  collected  from  a  laser  scan  of  200  sub¬ 
jects  were  statistically  summarized  for  presentation  and 
analysis.  However,  the  techniques  applied  to  the  prob¬ 
lem  go  beyond  the  classical  Fourier  or  trend  surface 
methods.  Statistical  methods  traditionally  reserved  for 
geology  were  applied  allowing  full  consideration  of  the 
correlation  structure  of  the  facial  area.  An  “average” 
face  along  with  upper  and  lower  percentiles  were  then 
available  as  input  to  computer-aided  design  programs. 


1  Introduction 

The  objective  of  the  study  was  to  develop  techniques 
for  statistically  analyzing  anthropometric  data  so  that 
physical  models  could  be  developed  to  support  the  de¬ 
sign  of  flight  equipment  such  as  oxygen  masks  and  lim¬ 
ited  visibility  goggles. 

The  statistical  methods  used  in  this  study  are  founded 
in  the  spatial  analysis  techniques  associated  with  krig- 
ing.  Because  relatively  very  little  has  been  published 
on  the  applications  of  kriging  outside  the  geostatistical 
co.nmunity,  a  brief  introduction  to  this  technique  is  pro¬ 
vided. 


*Thi«  study  was  sponsored  by  the  Human  Engineering  Division 
of  the  Harry  G.  Armstrong  Aerospace  Medical  Research  Labora¬ 
tory,  WPAFB.OH. 


2  Kriging 

Typically,  the  kriging  process  consists  of  two  phases: 
structural  analysis  to  determine  the  spatial  distribution 
of  the  variables,  and  estimation  using  a  best  linear  unbi¬ 
ased  estimator.  In  the  first  phase,  the  variogram  is  used 
to  quantify  the  structural  information  and  is  defined  as: 

7(/»)  =  2Var[F(i-h))-F(x)] 

where  F{x)  describes  a  random  function  over  the  sup¬ 
port  x,h  ^  R^.  In  practice,  a  model  is  fit  to  the  ex¬ 
perimental  variogram  using  weighted  least  squares  or 
graphical  techniques.  One  commonly  used  theoretical 
variogram  is  the  spherical  model  which  is  defined  as  fol¬ 
lows: 

f  +  h<a 

y{h)  =  ^C-|-Co  h>  a 

[  0  h  =  0 

For  estimation,  we  desire  an  unbiased,  linear  estimate 
F{x)  that  has  minimum  expected  estimation  error.  The 
estimate  of  F{x)  =  F(i)  is  assumed  to  be  a  linear  esti¬ 
mator  involving  N  observations  in  the  neighborhood  of 
F(r): 

;v 

F(x)  =  ^A,F(x) 

i=l 

where  the  weights  Aj’s  are  chosen  so  that  the  estimate 
is  unbiased:  F[F(z)  — F(i)]  =  0  (i.e. 
estimation  error  variance 

al  =  Var[F(x)  -  F(x)] 

is  minimized.  In  terms  of  the  variogram  the  estimation 
variance  is  given  by: 

•  }  i 

Minimizing  this  variance  subject  to  the  unbiased  con¬ 
straint  results  in  the  following  set  of  linear  equations: 
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i 

3 

where  i  —  1, ....  n,  p  is  a  Lagrange  multiplier,  /i,j  is  the 
vector  distance  between  observations  Fi(x)  and  Fj{x), 
and  hiF  is  the  vector  distance  between  Fi(x)  and  the 
point  to  be  estimated  F(x).  This  form  of  the  kriging 
equation  is  generally  refered  to  as  punctual  kriging. 

The  solution  to  these  equations,  AJ"  and  /i,  are  then 
used  to  make  point  estimates  of  the  surface  at  x: 

F(x)  =  ^A*F,(x) 

3 

along  with  estimates  of  the  prediction  error: 

+  ^'3^^^ 3  f) 

3 

For  a  more  detailed  discussion  on  kriging,  the  reader 
is  refered  to  the  geostatistical  literature  (e.g.  Cressie 
(1989),  David  (1977)  or  Journel  and  Huijbregts  (1989)). 

3  Problem  Solution 

The  end  product  of  this  study  was  a  prototype  for  eye- 
protection  gear.  The  following  discussion  outlines  the 
problem  solution  and  describes  data  collection  and  anal¬ 
ysis,  structural  analysis,  and  spatial  estimation. 

3.1  Data  Collection  and  Analysis 

Personnel  from  the  Armstrong  Aerospace  Medical  Re¬ 
search  Lab  collected  the  data  using  a  Cyberware  Echo 
digitizer.  The  laser  scanner  is  mounted  on  an  arm  which 
rotates  around  the  head  of  the  seated  subject,  providing 
measurements  of  131072  points  over  the  entire  surface 
of  the  head  (512  locations  on  the  plane  of  rotation,  and 
256  locations  on  the  vertical  plane).  The  correspond¬ 
ing  third  coordinates  are  determined  by  measuring  the 
depth  at  these  points  using  a  triangulation  procedure 
based  on  the  scanner’s  reference  point. 

Prior  to  analysis  of  the  spatial  properties  it  was  nec- 
cessary  to  orient  the  subjects  relative  to  a  common  aocis 
system.  The  data  for  the  region  surrounding  the  eyes 
was  established  by  truncating  the  data  sets  to  the  points 
within  the  glabella,  pronasale,  and  left  and  right  tra- 
gions.  The  subjects  were  aligned  using  a  multivariate 
optimization  routine  for  minimizing  the  squared  euclid¬ 
ian  distance  between  these  four  landmarks  and  four  pre¬ 
determined  external  reference  points. 


A  total  of  200  subjects  were  available  for  analysis  and 
all  associated  data  sets  were  aligned  to  a  common  axis 
system.  From  this  population  of  200,  a  random  sample 
of  35  subjects  were  chosen  for  structural  analysis. 

3.2  Structural  Analysis 

An  artificial  grid  of  50  by  100  was  established  and  super¬ 
imposed  upon  each  of  the  35  data  sets.  Initial  estimates 
of  the  global  trend  were  determined  using  a  simple  near¬ 
est  neighbor  method  for  a  selected  grid  structure.  After 
removal  of  this  initial  trend  from  a  subject,  the  resid¬ 
uals  were  analyzed  to  determine  the  nature  and  extent 
of  the  correlation  structure.  A  spherical  variogram  with 
parameters  C  =  2.226,  Co  =  0.689,  and  a  =  6.645  was 
found  to  best  describe  the  spatial  correlation  for  the 
region  of  the  face  of  interest  in  this  study.  Figure  1 
displays  the  theoretical  variogram  overlayed  on  the  ex¬ 
perimented  variograms  (for  four  directions)  for  a  typical 
subject. 


Figure  1.  Variograms  for  Subject  160 

3.3  Individual  Spatial  Estimation 

An  artificial  grid  of  50  x  100  points  was  superimposed  on 
the  residual  data  sets  and  the  kriging  equations  estab¬ 
lished  for  each  of  the  5000  points.  The  resulting  estimate 
when  added  to  the  globed  trend  provides  an  estimate  of 
the  function  describing  the  subject  face  as  well  as  an 
estimate  of  the  variance  of  the  surface  estimate  at  any 
point.  Figure  2  depicts  the  results  of  kriging  one  subject. 

It  was  assumed  that  the  variogram  and  the  global 
trend  surface  based  on  sampled  35  subjects,  represented 
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the  population  and  would  not  change.  Using  the  original 
variogram  and  trend  surface,  an  estimate  for  each  sub¬ 
ject  was  determined  individually  and  then  aggregated 
sequentially  using  a  recursive  relationship  to  update  the 
population  mean  and  variance. 


r 


4  Results 

This  scheme  was  applied  to  all  200  subjects.  Figure 
3  represents  the  final  surface  estimate  for  the  limited 
visibility  goggles  developed  with  this  procedure. 


Figure  2.  Kriged  Surface  for  Subject  9 


3.4  Population  Spatial  Estimation 

If  the  subjects  are  assumed  to  be  uncorrelated,  an  esti¬ 
mate  of  the  population  mean  and  variance  at  a  partic¬ 
ular  grid  location  for  the  subject  can  be  estimated 
by: 


with: 


=  Pkf>k 


1 


t=i 


Wi  = 


1 

1  +  <T,- 


where  Fj  and  <t,  are  the  kriging  mean  and  variance,  re¬ 
spectively,  of  the  i"'  subject  at  the  grid  location  of  in¬ 
terest.  After  a  bit  of  algebra  and  by  slightly  modifying 
the  algorithm,  the  following  relationships  result: 


Pk  =  Pk-\  +  [A  -  Pk-\] 


<^k  =<^k-i-  PkWl-i  -  {Fk  - 


Figure  3.  Night  Vision  Goggles  Surface  Estimate 

Following  the  estimate  of  the  upper  facial  surface, 
the  data  was  reformatted  for  input  to  a  numerically 
controlled  milling  machine  and  a  physical  model  con¬ 
structed.  This  model  has  since  been  used  to  develop  and 
evaluate  alternative  night-vision  goggle  support  struc¬ 
tures. 

Engineers  now  have  the  statistical  methods  to  sup¬ 
port  the  design  of  flight  apparatus  which  accounts  for 
the  shape  of  facial  features.  In  addition,  research  is  al¬ 
most  complete  on  extending  the  methods  to  include  not 
only  analysis  of  surface  features  (2-dimensional  data), 
but  also  analysis  of  3-dimensional  data  from  magnetic 
resonance  image  data. 
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Abstract 

Most  engineering  application  programs  are  designed  to 
model,  analyze,  design,  or  monitor  complex  systems.  Such 
systems  can  often  be  represented  by  schematic  diagrams. 
Hierarchical  modeling  is  a  method  by  which  complicated 
schematic  diagrams  can  be  expressed  in  a  more  easily  com¬ 
prehended  form.  This  psfex  describes  a  software  “layer”  for 
editing  hierarchical  schematic  diagrams  that  can  be  used  as  a 
generic  interface  into  many  application  programs. 

Introduction 

Schematic  diagrams  are  a  graphical  method  for  repre¬ 
senting  complex  systems.  These  diagrams  consist  of  icons 
representing  system  elements  and  connectors  representing  a 
logical  association  of  the  elements.  Examples  of  the  wide¬ 
spread  use  of  schematic  diagrams  in  modeling  are  in  control 
system  analysis,  simulation,  flowcharting,  and  communica¬ 
tions  (Hammond  et  al.(1989]),  Ozden[1991],  Saigent[1986], 
Stanwood  et  al.[1986]).  A  schematic  diagram  is  a  particularly 
powerful  tool  as  a  user  interface  to  any  set  of  programs  that 
analyze,  model,  monitor,  or  control  these  systems. 

One  of  the  key  advantages  of  schematic  diagram  based 
interfaces  is  that  many  tasks  of  model  validation  can  be  per¬ 
formed  within  the  gn^hical  interface.  The  user  simply  does 
not  make  many  mistakes  (such  as  mis-connections)  that  are 
easily  missed  in  other  formats.  Some  application-speciflc 
error  checking  can  proceed  and  trap  errors  before  analysis 
commences. 

This  paper  describes  one  such  interface  created  initially 
for  EASYS,  a  controls  analysis  and  non-linear  simulation 
program  written  by  Boeing  Computer  Services.  An  important 
aspect  of  the  interface  is  that  it  was  written  to  be  independent 
of  and  isolated  from  the  underlying  program  or  application.  It 
was  therefore  possible  to  include  features  generally  desirable 
in  many  applications  of  schematic  diagrams;  the  interface  is 
currently  teing  used  in  at  least  two  quite  different  areas. 

Hierarchical  Structure 

A  natural  simplification  of  the  schematic  diagram  results 
when  collections  of  elements  that  perform  related  functions 
or  operations  are  grouped  into  a  single  new  meta-element. 


The  meta-element  contains  the  sub-graph  of  related  elements, 
but  appears  as  a  single  icon.  These  meta-elements  then  repre¬ 
sent  another  “level”  in  the  schematic  structure,  and  convert 
the  two-dimensional  graph  to  a  tree-structured  acyclic  graph; 
acylic  in  the  sense  that  data  flow  between  the  nodes  or  levels 
is  directional.  Each  level  contains  a  sub-graph.  Connectors 
still  actually  terminate  at  elements,  but  appear  to  terminate  at 
the  meta-element  that  represents  the  next  level  down  in  the 
tree.  If  the  contents  of  the  meta-element  are  displayed,  the 
isolated  piece  of  the  schematic  is  visible;  connectors  to  ele¬ 
ments  in  the  “parent”  schematic  appear  to  run  to  the  edge  of 
the  window.  The  entire  schematic  becomes  a  hierarchy  of 
meta-elements. 

Meta-elements  can  provide  a  dramatic  simplification  of 
complex  schematics.  Fig.  1  shows  an  EASYS  schematic 
model  of  a  complex  aircraft  control  system  before(a)  and 
after(b)  grouping  elements  into  meta-elements,  and  the 
^pearance  of  the  sub-gr^h  within  one  meta-element 
fonned(c).  The  schematic  structure  can  become  quite  com¬ 
plex  without  the  user  losing  an  intuitive  grasp  of  the  system. 

The  above  description  specifies  a  graphical  process  to 
form  a  meta-element.  No  changes  to  the  represented  system 
structure  are  involved.  Meta-elements  may  also  be  created  in 
which  the  functionality  of  the  individual  elements  is  com¬ 
pletely  isolated  from  the  rest  of  the  schematic.  Access  to  the 


Fig.  1(a).  Schematic  diagram  of  complex  control  system, 
before  forming  meta-elements. 
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Fig.  1  (b).  The  same  schematic  diagram  but  simplified  with  Bg- 1  (c).  The  result  of  an  examine  operation  on  one  of  the 

the  formation  of  meta-elements.  meta-elements  formed.  Note  the  off-page  connectors. 


variables  and  parameters  of  the  meta-element  is  provided 
through  a  specified  subset  of  all  possible  connectable  inputs 
and  outputs.  In  this  way  a  grouping  of  elements  becomes  an 
entity  similar  to  a  true  element,  and  is  referred  to  as  a  param¬ 
eterized  meta-element. 

Both  types  of  meta-elements  have  a  certain  utility  in  sim¬ 
plification.  We  chose  to  implement  graphical  meta-elements 
only,  as  they  are  easier  to  understand,  create,  and  manipulate. 

Meta-elements  are  an  aid  to  configuration  control,  as 
they  can  be  stwed  in  libraries  and  included  as  “plug-in”  or 
reusable  subsystems  in  any  system.  Testing  can  proceed  on 
new  enhancements  to  a  subsystem  while  other  users  of  the 
same  meta-element  can  rely  on  the  older  versions;  when  the 
testing  is  complete,  all  schematics  can  access  the  improved 
version. 

Programming  Concepts 

Schematic  diagrams,  by  their  nature,  lend  themselves  to 
an  object-oriented  treatment.  Each  object  in  the  diagram  con¬ 
tains  common  attributes  such  as  appearance-related  informa¬ 
tion  (size,  color,  position),  current  state  (“selected”  or  other 
application-related  states),  and  a  unique  identifier.  Applica¬ 
tion-defined  data  is  stored  in  “pigeonholes”,  or  pointers  to 
data  that  the  schematic  program  passes  without  processing.  It 
is  this  latter  aspect  that  allows  the  separation  of  application 
end  use  of  the  schematic  and  the  schematic  interface  itself. 

Data  contained  within  the  pigeonholes  can  be  observed 
with  data  viewers.  These  pieces  of  the  interface  are  applica¬ 
tion  specific  and  allow  a  user  to  “examine”  elements  of  the 
schematic  diagram  for  content  and  change  that  content,  as 
well  as  observing  and  editing  aspects  of  the  data  flow  within 
the  system.  For  instance,  EASYS  elements  have  defined  sig¬ 
nal  inputs  and  outputs.  If  the  inputs  are  not  connected  or 


filled  by  variables  from  other  elements,  they  become  static 
parametos  of  the  system.  Fig.  2  shows  a  data  viewer  for  an 
element  that  contains  editable  FORTRAN  code  and  defin¬ 
able  inputs  and  outputs. 

A  schematic  diagram  is  a  representation  of  a  system. 
The  topology  of  the  system  refers  to  the  mathematical  struc¬ 
ture  represented  by  the  graph  itself,  and  not  to  the  appear¬ 
ance  of  the  schematic  diagram.  This  implies  that  operations 
that  simply  affect  the  appearance  of  the  schematic,  such  as 
moving  elements,  grouping  elements  into  meta-elements, 
etc.  do  not  require  application-specific  interaction. 

Interface  functions  are  called  by  the  schematic  program 
whenever  modifications  are  being  made  to  the  system  topol¬ 
ogy.  These  operations  include  adding  or  deleting  elements  or 
connectors.  This  allows  the  specific  application  to  approve 
ot  reject  the  change  or  inquire  for  more  information. 

Information  not  directly  related  to  or  contained  within 
specific  elements  or  connectors  may  be  added  to  the  sche¬ 
matic  as  drawings,  i.e.,  collections  of  graphics  and  text  infor¬ 
mation  not  required  by  the  system  but  helpful  for 
documentation. 

Elements  and  drawings  store  their  appearance  as  vector 
or  other  scalable  graphic  information,  editable  by  the  user. 
Bitm^  icons,  while  sometimes  superior  in  appearance,  are 
difficult  to  scale. 

General  Appearance  of  the  Interface 

The  interface  appears  as  a  window  surrounded  by  some 
control  panels  and  menu  bars.  In  the  large  central  window,  the 
user  may  add  elements  (icons)  from  a  menu  of  available  func¬ 
tions  and  position  them  in  the  window.  The  appearance  of 
individual  elements  may  be  changed  with  a  graphics  editor. 
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Fig.  2.  EASY5  data  viewer  during  an  examine  operation 
on  a  FORTRAN  element. 


and  text  or  other  symbols  or  drawings  may  be  added  to  the 
schematic.  By  selecting  (moving  a  cursor  over  and  clicking  a 
mouse  button)  an  element,  it  is  identified  for  subsequent  oper¬ 
ations  such  as  moving  its  location,  deletion,  copying,  or 
attaching  a  connector  to  another  clement.  Clicking  another 
button  on  the  mouse  performs  an  examine  operation  which 
gives  the  user  some  feedback  as  to  the  contents  or  signifi¬ 
cance  of  the  examined  object  (Generally,  a  data  viewer  is 
invoked.).  The  window  may  be  “panned”  over  the  schematic 
to  view  different  portions  of  it,  and  the  magnification  of  the 
view  may  be  changed.  Application-specific  controls  in  the 
menus  perform  analysis  tasks.  Schematics  may  be  saved  and 
read  in,  with  previous  versions  automatically  numbered. 
Schematic  diagrams  generated  by  completely  different  appli¬ 
cations  can  still  be  viewed  and  graphically  edited  in  any  pro¬ 
gram  using  the  interface. 

Specific  Requirements 

Through  much  experience  with  engineering  u.se  of  .sche¬ 
matic  diagram  interfaces,  certain  requirements  for  their  con¬ 
struction  have  been  established.  These  include  features  that 
make  the  interface  more  intuitive  and  easy  to  u.se  as  well  as 
performing  some  of  the  laborious  engineering  tasks  of  mod¬ 
eling: 

(1)  The  schematic  must  allow  scaling  operations  such  as 
zoom  and  pan  so  that  the  diagram  may  be  viewed  in  its 
entirety  or  enlarged  to  study  a  specific  portion. 

(2)  Connectors  must  initially  be  routed  automatically. 
This  removes  the  onerous  task  of  cleaning  up  connectors 
whenever  the  position  of  elements  is  changed  in  the  .sche¬ 
matic.  There  arc  many  rule-based  algorithms  for  connector 
routing  available  (.see  Heller  cl  al.[19821  or  Lcc[19611). 
Automatic  routing  here  implies  rectilinear  routing  that  avoids 


obstacles  such  as  elements. 

(3)  Creation  and  destruction  (the  inverse  operation)  of 
meta-elcments  should  be  very  transparent  to  the  user. 

(4)  Navigation  or  view  control  between  levels  of  the 
schematic  (i.e.,  between  meta-elements)  should  be  quite  fac¬ 
ile.  By  executing  an  examine  operation  on  any  meta-clemcni, 
the  user  should  then  view  the  level  within  that  meta-cIcmcnt. 
Navigation  is  facilitated  by  a  dynamic  list  of  available  meta- 
elements,  and  labeling  that  indicates  the  current  level  being 
viewed. 

(5)  Hardcopy  must  be  presentable  as  a  final  document 
and  contain  as  much  information  as  possible  without  being 
cluttered. 

Conclusions 

A  schematic  interface  is  an  excellent  enhancement  to 
many  existing  computer  programs  that  arc  quite  mature  in 
their  computational  ability  but  require  detailed  and  complex 
input  from  the  user.  It  is  also  a  useful  paradigm  for  building 
an  analysis  program  from  scratch.  The  interface  we  built  is 
being  used  for  both  cases.  In  the  case  of  EASY5,  more  than  6 
years  of  experience  with  a  commercial  product  that  has 
included  an  increasingly  sophisticated  schematic  front  end 
has  demon-su-ated  vastly  increased  productivity  for  the  engi¬ 
neers  that  use  it. 
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Abstract 

Two  major  developments  in  statistical  computing  in  the 
eighties  were  array  languages  and  object  oriented 
programming.  These  developments  have  been  rea’ized  only 
fragmentally  until  now.  M++  is  a  collection  of  C++  classes, 
methods,  and  functions  for  array  manipulation,  linear 
algebra,  eigensystem  analysis,  matrix  factorization,  and 
general  numeric  and  statistical  computation.  M++  extends 
C++,  creating  a  powerful,  object-oriented  array  language 
with  direct  access  to  all  of  the  features  of  C  and  C++. 


1  Introduction 

Scientific  and  statistical  computing  have  splintered  in  the 
eighties  into  parts;  a  part  using  the  new  array  language 
interpreters  such  as  S,  GAUSS,  MATLAB,  and 
MATHEMATICA;  and  another  part  remaining  with 
FORTRAN.  Both  of  these  have  ignored,  to  a  large  extent, 
the  development  of  the  C  programming  language,  which 
seems  to  have  captured  a  major  part  of  the  rest  of  the 
computing  world. 

There  have  been  tentative  movements  toward  C  from 
FORTRAN,  but  the  obstacles  are  formidable.  There  is  a 
large  investment  in  FORTRAN  code  that  cannot  be  ignored, 
nor  are  C  arrays  really  convenient  for  matrix  computation. 
C  also  seems  to  require  a  significant  amount  of  training; 
acquiring  FORTRAN  is  a  less  intimidating  problem. 

While  many  researchers  have  stayed  with  FORTRAN, 
others  have  been  turning  to  array  languages.  The  ability  to 
write  code  manipulating  arrays  of  data  with  a  simple, 
intuitive  syntax  often  proves  irresistable.  Mathematical 
expressions  translate  directly  into  code  and  many  lines  of 
FORTRAN  and  C  code  are  replaced  by  a  few  statements. 
Productivity  is  dramatically  improved  and  this  can  convert 
to  the  undertaking  of  more  advanced  projects  than  would  be 


attempted  with  FORTRAN  or  C. 

All  array  languages  of  any  significance,  however,  are 
interpreters,  have  few  low  level  features,  tend  to  be  weakly 
typed,  and  are  not  extensible.  While  array  languages 
provide  a  convenient  method  for  the  manipulation  of  arrays 
of  data,  their  syntax  may  become  overburdened  when 
applied  to  large,  complicated  tasks. 

Many  researchers  who  have  moved  to  the  array  languages 
are  now  beginning  to  encounter  their  limitations.  They  may 
regret  having  left  FORTRAN’S  superior  numeric  standards 
behind,  and  they  may  be  fhistrated  with  their  inability  to 
develop  large,  complex  applications  in  the  array  language 
environment.  On  the  horizon,  though,  is  something  new,  an 
array  language  extension  to  C++  called  M++  (Dyad 
Software,  1991),  that  may  alleviate  these  problems.  M++  is 
a  complete  array  language  extension  to  C++  containing 
multi-dimensional  arrays  of  all  of  the  C++  built-in  data 
types,  along  with  a  full  range  of  statistical  and  mathematical 
operators  and  functions.  C++  is  a  superset  of  C,  giving  it 
and  M++  access  to  all  of  the  low  level  functionality  for 
which  C  is  well  known.  Translators  and  native  compilers 
exist  for  C++  on  most  platforms  that  support  C,  which 
means  nearly  all  operating  systems  that  exist,  and  therefore 
M++.  a  non-interpretive,  full  featured  array  language  with 
direct  access  to  a  powerful,  low-level  language,  will  be 
available  on  just  about  every  computer. 

2  C++,  AN  OBJECT-ORIENTED  EXTENSION  OF  C 

C  has  a  number  of  features  not  available  in  FORTRAN  - 
run-time  allocation  of  memory,  more  convenient  lA).  scope, 
and  structures,  for  example.  Together  they  may  not  add  up 
to  enough  advantage  for  the  researcher  to  move  to  C.  The 
features  C++  adds  to  C,  however,  are  substantial  and  may 
prove  to  be  worth  the  attention  of  the  researcher.  The 
implementation  of  the  object-oriented  concept  in  C++ 
provides  for  user-defined  data  types  with  data-hiding. 
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derivation  and  inheritance,  function  and  operator 
overloading,  as  weO  as  control  over  the  creation,  destruction, 
declaration,  and  assignment  of  objects  (Ellis  and  Stroustrup, 
1990).  Other  implementations  of  the  object-oriented 
paradigm  exist,  but  C-m-  is  particularly  suited  for  the 
researcher  because  of  its  efficiency  and  because  of  the 
availability  of  the  built-in  numerical  types  and  operators 
inherited  from  C.  Most  C++  compilers  (not  to  mention  C 
compilers)  have  yet,  however,  to  incorporate  the  IEEE 
numeric  standards  that  have  always  been  a  part  of 
FORTRAN.  An  exception  is  the  ZORTECH  C++  "Science 
and  Engineering"  compiler  that  fully  implements  the 
standards  set  by  the  ANSI  Numerical  C  Extensions  Group, 
based  on  the  IEEE  numeric  standard  (Ladd,  1991).  The 
researcher  typically  wwks  with  large  sets  of  data,  and  often 
the  operations  on  these  data  can  be  described  mathematically 
in  a  simple  form.  A  program  written  in  C,  however,  must 
deal  with  each  number  in  an  explicit  way.  A  general 
program  written  to  compute  some  basic  statistics,  for 
example,  must  first  allocate  and  initialize  memory  for  every 
number  to  be  stored.  It  then  reads  the  numbers  in  one  by 
one,  and  provides  instructions  in  loops  to  compute  the 
statistics,  number  by  number.  For  example,  the  following  C 
program  reads  in  ^ta  and  computes  means  and  standard 
deviations; 

#include<stdlib.h> 

»inolude<stdio.h> 

#  Inc  lude<inath .  h> 

#include<bios . h> 

void  main(argc,argv) 
int  argc; 
char  **argv; 

{ 

char  *fn; 

£n  =  argv[l]; 

numObs  =  atoi  (argv[2 1 )  .• 

numVars  =  atoi (argvtSl ) ; 

/*  DECLARATION  */ 

FILE  *  filen; 

int  numObs , numVars , i , j . n ; 

double  "data,  ♦‘mm,  *mn,  *sd; 

/♦  ALLOCATE  MEMORY  ♦/ 

data  =  (double**)malloc(numObs*3izeo£(double*) ) ; 
iwn  =  (double**)malloc(nuinVars*sizeot(double*) ) : 

£or(n  =  0;  n  <  numObs;  n++) 
datalnj  = 

(double*)malloc (numVars*sizeo£ (double) ) ; 

£or(i  =0;  i  <  numVars;  i++) 
tnm[i]  = 

(double  *)malloc (numVars*sizeo£ (double) ) ; 
mn  =  (double  * )malloc  (nuinVar9*sizeo£  (double) )  ; 
sd  =  (double  ’imalloc (numVars*sizeo£ (double) ) ; 

/*  READ  IN  DATA  ♦/ 

filen  =  £open( £n, 'r' ) ; 

£or(n  =  0;  n  <  numObs;  n++) 
ford  =0;  i  <  numVars;  i++) 

fscanf (filen, "^If’/SdataCn) ( i) ) ; 

/*  INITIALIZE  ARRAYS  */ 

ford  -  0;  i  <  numVars;  it  +  )  ( 
mn ( i J  =  0 ; 

for(j  =0;  j  <  numVars;  j++)  ( 
mm[i] t j)  =  0; 

) 


) 

/*  COMPUTATIONS  */ 

for(n  =0;  n  <  numObs;  n++)  ( 

ford  =  0;  i  <  numVars;  i  +  +  )  { 
mnd]  *=  data[n]d]; 
for(j  -  0;  j  <  ntjmVars;  j++)  ( 

mm[i)(j)  +=  data(n)  dl ‘datatnl  ( j  )  ; 

) 

) 

) 

ford  =  0;  i  <  numVars;  i  +  +  )  { 
mnd)  /=  numObs ; 

£or(j  =  0;  j  <  numVars;  j++)  { 
mm  d ] [ j ]  /-  numObs ; 

) 

) 

for ( L  =  0 ;  i  <  numVars :  i+  + ) 

sdd)  =  sqrt(mmd]dl  -  mnd]*rand]); 

/*  PRINT  RESULTS  */ 

print  f ( • \n\nMeans\n" ) ; 
for(i  =  0;  i  <  numVars;  i++) 
printf(-  %f \n' ,mn [ i ) ) ; 

printf ( • \nStandard  Deviations \n‘ ) ; 

£or(i  =  0;  i  <  numVars;  i++) 
printf (•  %£\n* , sd[ i) ) ; 

) 

In  C++,  a  class  can  be  designed  to  take  care  of  much  of  this 
tedious  work.  Initialization  and  allocation  of  memory  can  be 
handled  in  the  construction  of  the  object.  I/O  operatcrs  can 
be  overloaded  to  handle  the  reading  in  and  writing  out  of 
objects.  Math  operators  can  be  overloaded  to  perform  the 
calculations.  All  of  the  loops  can  be  eliminated  and  essential 
information  about  the  objects  can  be  hidden  away  in  the 
object  so  the  researcher  doesn’t  need  to  be  concerned  with 
them  once  the  objects  have  been  declared.  The  solution  of 
the  above  problem  in  M++  with  a  class  called  doubleArray 
(for  array  of  double  precision  numbers)  is  found  in  the  mcxe 
readable  program  below: 

#include<darray .h> 
i  includeotdl  ib .  h> 
line  lude<streajn  .hpp> 

void  main (int  argc,  char  *  argv) 

( 

char  *  fn; 
fn  =  argvdj; 

//  DECLARE  AND  READ  IN  DATA 
doubleArray  data; 
data. readASCII ( fn) ; 

//  COMPUTATIONS 

doubleArray  mn,vc,sd; 
mn  =  mean (data, 0) ; 

VC  =  transpose(data) .product (data) /numObs; 

VC  -=  transpose (mn) *mn; 

sd  =  sqrt ( VC ()( Index (0, numVars , numVars+ 1 ))) ; 

//  PRINT  RESULTS 

cout  «  "MeanaVn*  <<  mn; 

cout  «  ’Standard  Deviations\n*  <<  sd; 

) 

Declaration,  allocation  of  memray,  and  initialization  of  the 
arrays  are  accomplished  in  single  statements  replacing  many 
lines  of  C  code.  One  statement  handles  the  input  of  the 
data,  dimensioning  the  array  automatically  so  that  command 
line  arguments  are  no  long  necessary.  The  many  loops  in 
the  C  code  are  reduced  to  a  few  lines  of  code. 
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The  critical  feature  of  C++  is  its  ability  to  provide  a 
syntactical  interface  that  fits  the  conceptual  parts  of  the 
problem.  It  is  possible  to  design  a  set  of  functions  in  a  non¬ 
object-oriented  language  that  performs  the  above  task  in  just 
about  as  few  lines  of  code.  There  wouldn’t,  however,  be 
any  fit  of  these  functions  to  the  elements  of  the  problem. 
Each  function  would  require  a  series  of  arguments  that 
would  have  to  be  documented  and  referred  to  for  their  use 
because  there  wouldn’t  be  any  natural  way  to  handle  them. 
The  researcher’s  task  in  a  functional  language  involves 
assembling  and  arranging  arguments,  and  thus  the  problem 
must  be  translated  into  a  structural  form  dictated  by  the 
syntax  of  the  programming  language. 

C++,  on  the  other  hand,  has  a  syntax  that  can  be  designed 
to  fit  tlie  problem.  The  researcher’s  problem  can  be  broken 
down  into  parts  in  a  natural  way.  If  arrays  are  a 
fundamental  part  of  the  problem  then  an  array  type  can  be 
created  and  the  program  will  now  be  developed  with  arrays 
as  a  fundament^  type. 

3  M++,  AN  ARRAY  EXTENSION  TO  C++ 

For  the  researcher  the  array  type  is  necessary.  While  C++ 
offers  them  an  opportunity  to  design  their  own  such  type, 
they  may  also  turn  to  a  well  designed  class  library  such  as 
M++.  Three  years  have  been  devoted  to  the  development  of 
the  M++  class  library  that  turns  C++  into  a  powerful  array 
language  with  a  fundamental  array  type  having  four 
dimensions,  but  easily  extendible  to  any  number  of 
dimensions.  It  incoporates  a  full  complement  of 
mathematical  and  statistical  operators  and  functions, 
including  many  based  on  EISPACK  and  LINPACK. 

Using  M++  as  a  base  the  researcher  can  go  on  to  more 
complex  abstractions.  Consider  the  problem  described  in  the 
previous  section.  The  covariance  matrix  is  computed  in  a 
standard  way.  However,  the  symmetric  matrix  result  entails 
the  calculation  of  n*(n-l)/2  redundant  elements.  This 
duplication  of  effort  as  well  as  certain  problems  in  precision 
could  be  avoided  if  the  result  could  be  computed  from  an 
update  to  a  Choleski  factorization. 

To  solve  this  problem,  first  we  create  a  Moment  class 
derived  from  the  M++  Choleski  class.  The  Moment 
constructor  would  take  a  data  set  as  an  argument  and 
compute  a  Choleski  factorization  via  the  update  method 
augmenting  the  data  set  with  a  column  of  ones.  It  would 
then  store  the  result  as  a  private  data  member  in  its  base 
class.  The  Moment  class  would  have  methods  for 
computing  moments,  means,  covariances,  and  so  on,  from 
the  factorization.  It  would  also  be  able  to  use  the  base  class 
methods  for  computing  inverses  and  determinants. 

When  actually  using  this  class,  analysts  wouldn’t  have  to 


concern  themselves  with  how  the  data  were  stored  in  the 
object.  All  they  would  know  is  that  they  have  created  a 
Moment  object  that  may  be  interrogated  for  various  kinds 
of  information  about  the  data.  For  example, 

# include<dar ray . h> 

#include<dmom . h> 
iinclude<stdlib . h> 

# include< sC ream . hpp> 

void  main(int  argc,  char  •  argv(2]) 

{ 

char  *  fn; 
fn  =  argv[l] ; 

//  DECLARE  AND  READ  IN  DATA 
doubleArray  data; 
data . readASCII ( fn) ; 

//  DECLARE  MOMENT  OBJECT 
doubleMoment  M(data) ; 

/  /  PRINT  RESULTS 

couc  «  *Means\n"  «  M.meanO; 

cout  «  'Standard  Deviations\n'  «  M.stdDevO; 

cout  «  "Correlation  MatrixNn'  «  M.corrO; 

) 

We  have  now  reduced  our  original  problem  to  the  simple  act 
of  declaring  an  object  While  we  still  haven’t  done  anything 
that  couldn’t  be  accomplished  by  a  statistical  package,  C++ 
is  just  beginning  whereas  the  statistical  package  stops  there. 
Now  that  we  have  a  Moment  object,  we  may  treat  it  like 
any  other  object  Operations  can  be  defined  to  manipulate 
it  For  example,  pooled  moment  matrices  can  be  created  by 
adding  Moment  objects,  or  they  can  be  updated  with  more 
observations  by  adding  the  Moment  object  to  an  array 
object  containing  a  data  set.  Or  arrays  of  Moment  objects 
can  be  created  and  manipulated  in  higho’  order  array 
operations.  The  statistical  package  is  at  the  end  of  a  creative 
effort,  but  C++  is  only  the  beginning  of  the  creative 
mathematical  and  statistical  imagination,  and  M++  is  a  step 
on  the  way. 

4  Conclusion 

C++  has  opened  the  way  to  object-oriented  program  design 
for  numerical  analysis.  M++  is  the  foundation  for  the 
application  of  C++  to  numerical  problems  by  providing  the 
array  classes  for  handling  memory  allocation  and 
initialization  as  well  as  the  opoators  and  functions  for 
manipulating  them.  As  is,  M++  has  the  functionality  of  a 
complete  array  language,  but  it  needn’t  stop  there.  New 
objects  can  be  derived  from  the  classes  available  in  M++ 
that  fit  the  problem  the  analyst  has  in  mind.  What  is 
needed?  It  might  be  arrays  of  rational  numbers  -  certain 
kinds  of  problems  can  be  posed  entirely  in  rational  numbers. 
Or,  perhaps,  interval  arithmetic  is  required  in  which  an  upper 
and  lower  bound  is  stored  rather  than  a  single  number.  A 
derived  interval  class  with  a  set  of  overloaded  operators  and 
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functions  would  allow  an  analyst  to  develop  large,  complex 
programs  with  a  syntax  that  relieved  them  of  having  to  think 
about  the  intervals. 

Whatever  the  analytical  problem,  the  researcher  will  find 
in  C++  and  M++  a  powerful  tool  for  solving  it. 
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Introduction 

This  paper  discusses  programming  principles  and 
practices  required  to  construct  a  crosstabulation 
program,  a  subset  of  a  statistics  package  substantial 
enough  to  illustrate  these  principles,  without 
involving  numerical  analysis  problems.  This  is  a 
preface  to  our  construction  of  a  simple  but  powerful 
tabulation  program  via  extracts  from  the  code  in  our 
own  packages  (e.g.  StatZ  [11) 

Objectives  and  Limitations 

For  the  present  purposes,  we  envisage  users  who 
are  moderately  experienced  researchers,  typically 
wanting  many  tables  of  modest  complexity  from 
potentially  large  data  -  for  instance,  researchers 
with  surveys  of  300  readings  on  each  of  10000 
subjects.  This  is  not  really  ignoring  the  masses 
with  smaller  jobs  and  less  expertise;  efficiency 
considerations  will  also  usually  be  similar. 

The  major  initial  technical  decision  to  be  made  in 
the  face  of  limitations  concerns  data  storage.  In  a 
sense  the  decision  is  finally  simple  -  data  cannot  be 
held  in  memoiy  as  it  is  prone  to  be  too  large.  Thus 
it  must  be  treated  in  segments.  'Paged'  memory  is 
an  attractive  option,  but  as  not  all  systems  are 
particularly  well  set  up  for  it,  pro^am  developers 
would  probably  have  to  implement  it  themselves  for 
their  package  input,  and  it  would  have  system 
dependencies.  Thus  the  extra  programming  implied 
by  taking  the  data  in  on  a  per  case  basis  and  fully 
utilising  that  line  to  contribute  to  all  the  relevant 
tables,  before  replacing  it  with  the  next  line  and 
repeating  the  table  construction  calculations,  is 
easily  justified,  and  is  in  fact  the  method  used  by 
most  of  the  older  statistical  packages  too.  Thus  our 
initial  program  emphasis  here  is  on  formal 
structures  in  the  language  Pascal  (Wirth,  [5])  for 
the  tables  description,  which  dictate  the  table 
computations  to  be  done  as  each  data  line  is  read. 

We  have  used  Pascal  as  the  programming  language 
to  show  this  work  on  the  grounds  that  it  is  the  best 
known  language  of  serious  programmers,  and  more 


importantly  it  is  a  fully  structured  language. 
Further  discussion  on  Pascal's  general  virtues  are 
given  in  the  references  [2,3,4]. 

Data  Structure 

We  only  read  one  case  at  a  time  from  the  data  Hie, 
so  the  input  structure  is  simply  an  array.  The  tables 
formed  will  also  be  in  a  single  array.  Respectively, 
we  define  types  and  variables : 

type 

Data  Array  =  array!  l..maxVars]  of  real; 
TableArray  =  array[l..maxCelIs]  of  real; 

Var  DataCase  :  DataArray;  Tables  :  TableArray; 

It  is  of  considerable  relevance  how  the  second 
array  is  actually  used;  for  the  moment  note  again 
that  at  any  point  in  the  computation  the  all  the 
tables  within  the  table  array  are  updated  using  only 
the  data  line  just  read.  We  assume  that  data 
elements  are  re^  numbers;  more  general  situations 
of  mixed  reals,  integers  &  text,  transformations, 
recodings  and  selection  criteria  are  merely  later 
refinements. 

Essential  Structures  1: 

Record  and  Set  Types  in  Table  Definitions 

We  consider  a  table  definition  as  having  three 
components  - 

1)  a  classifying  factors  array  defining 
the  variables  and  their  order  within  the  table 

2)  a  set  defining  the  statistics  to  be  placed 
in  the  cell  defin^  by  the  indices  of  1) 

3)  the  object  variable,  if  any,  to  which 
the  statistical  computation  is  applied 

The  first  component  is  illustrated  by  the  description 
AgeGroup  *'  Sex  *  Country  ,  say, 
meaning  that  we  are  defininga  three  way  table  of  24 
cells,  where  there  are,  say 

4  AgeGroups  [<18, 18-37,  38-62,  >62  years] 

2  Sexes  [male  female]  and 

3  Countries  [Australia,  US,  UK] . 
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If  we  want  simply  a  table  of  counts  of  our  cases 
according  to  this  cross-classification,  the  statistic 
required  as  the  second  component  is  'count',  and 
this  is  the  default  computation.  However  the 
structure  optionally  allows  us  to  request  means  of  a 
(continuous)  variable,  say  Salary,  by  selecting  the 
statistic  to  he  'mean'  and  the  component  3  'object 
variable'  as  Salary.  Component  3  is  not  relevant 
unless  component  2  is  other  than  'Count'. 

The  use  of  record  type  constructs,  a  variable  with 
many  component  sub^variables,  is  well  entrenched 
in  the  structured  languages,  and  we  use  them 
heavily  throughout  our  code.  It  is  appropriate  here 
to  define  a  table  specification  by  a  record  type: 
type  TableSpec  = 
record 

ClassSet  :  array! l..inaxFact]  of  integer; 

TabSize  :  integer; 

TabBase  :  integer; 

StatsWanted  :  StatSet; 

ObjectVar:  integer; 
end; 

where  TabBase  is  the  address  in  the  table  array  of 
the  first  cell  of  the  constructed  table,  and  the  set 
type 

StatSet=('count',  'mean',  'stdev',  'min',  'max'); 
is  the  set  of  possible  statistics  to  be  calculated  for 
each  object  variable  given. 

This  formulation  makes  it  possible  to  specify  many 
tables  to  be  obtained  in  a  single  pass  through  the 
data,  each  specification  being  stored  as  an  element 
of  the  vector  of  TableSpec  records 
var 

TabsInPass  =  array[l..maxTabs]  of  TableSpec; 


Each  table's  memory  allocation  is  visualised  as  a 
linear  array,  and  the  collection  of  table  arrays  are 
stored  contiguously  to  form  the  single  linear  array 
of  all  the  table  cells  as  declared  above.  The  i'th 
table  then  has  an  origin  at  location  number 
TabsInPass[i].TabBase  within  this  large  array.  We 
detail  later  a  further  payoff  in  this  structure  in  that 
multiple  response  ('group')  variables  are  efficiently 
handled  within  the  same  structural  processing. 

At  runtime,  for  each  data  line  we  simply  cycle 
through  all  the  elements  of  the  TabsInPass  array  of 
table  definitions.  The  target  cell  within  the  i'th 
table  has  an  offset  from  TabBase  computed  from 


the  values  of  the  classification  variables  found  in 
the  data  line.  The  statistic  wanted  determines  what 
is  done  in  the  target  cell  -  if  a  count  is  required  the 
cell  is  merely  incremented  by  unity,  while  for  a 
mean  of  Salary,  the  cell  is  incremented  by  the 
case's  value  of  Salary;  or  a  substitution  may  be 
needed  if  maximum/minimum  is  required. 

As  computing  efficiency  matters  do  not  impact  the 
structural  considerations,  we  pass  over  them  here. 

Essential  Structures  2: 

Group  Variables 

Most  surveys  use  group  variables,  often  more  by 
accident  than  design.  These  are  of  two  structurally 
different,  though  related,  forms. 

The  first  is  illustrated  by  the  following.  Supix)se 
an  opinion  pollster  conducts  a  survey  on 
newspaper  readership,  in  which  persons  are  asked 
to  name  up  to  three  papers  they  read  regularly.  The 
survey  form  would  have  three  slots  to  fill,  and 
these  would  usually  become  three  variables,  say 
Paper  1,  Paper2  and  Paper3.  However  these  three 
variables  are  equivalent  in  the  sense  that  each  has 
the  same  range  of  possibilities  -  one  respondent 
could  name  the  Daily  Bugle  in  Paper2  while 
another  respondent  does  not  have  it  at  all  and  a 
third  names  it  in  Paperl.  The  pollster  usually 
wants  as  one  of  the  results,  a  table  of  the  form 
Paper  by  AgeGp  by  Sex  , 
in  effect  combining  the  three  simple  tables 
Paperl  by  AgeGp  by  Sex 

Paper2  by  AgeGp  by  Sex 

Paper3  by  AgeGp  by  Sex  . 

We  solve  this  by  effectively  processing  each  case 
three  times,  using  Paperl  fmst  time,  then  Paper2 
(with  same  Age  and  Sex  values),  then  Paper3.  In 
the  present  structure  this  is  easy  to  organise  -  three 
tables  are  formed  but  the  trick  is  that  each  is  given 
the  same  base  address.  Hence  the  cumulative  table 
is  formed  without  physically  getting  the  separate 
constituent  tables  and  adding  them.  This  means  we 
do  not  have  to  reserve  space  for  three  tables,  nor 
waste  time  with  exceptions-testing  code  in  the 
innermost  loops.  All  exceptions  processing  work 
then  occurs  in  the  time  non-criticd  phases  of  table 
definition  and  output  configuration. 

The  second  group  variable  situation  is  an  extension 
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of  the  above  (conversely,  the  first  is  a  'cheap'  form 
of  this  one).  In  this,  the  survey  carries  a  (usually 
large)  list  of  newspapers  to  be  considered,  and  the 
respondents  tick  all  relevant.  So  we  have  Paper  1, 
...  Paper25  say,  and  any  number  can  be  ticked  by  a 
single  respondent.  Here  we  consider  the  group 
variable  Paper  as  a  'super  variable'  having  Paper  1, 
...Paper25  as  its  25  'levels'.  Again,  processing 
cases  is  transparent  if  we  define  twenty-five  tables 
as  above,  now  with  appropriately  modified  base 
addresses,  and  leave  the  fact  that  there  is  only  one 
real  result  table,  a  concatenation  of  the  25,  to  be 
easily  sorted  out  at  print  time  (this  involves  some 
structural  detail  for  aliased  and  non-printed  tables). 
The  wastage  of  table  definition  space  is  offset  by 
the  economy  gained  by  the  fact  that  target  index 
computations  for  the  25  tables  to  be  updated  for  a 
single  case  have  all  but  one  source  index  identical. 

The  record  structure  describing  both  types  of  group 
variables  uses  a  set  type  variable  for  group,  and  is: 
type  groupvars  = 
record 

GrpName  :  string[15]; 

GrpType  :  (e,s) 

GrpLo,  GrpHi  :  integer; 

Grpindxs  :  Array! l..niaxInGrp]  of  integer; 
end; 

Essential  Structures  3: 

The  Driving  Program  and  Overlays 

Having  considered  specific  calculation  modules, 
we  turn  to  the  main-line  or  driver  program,  and  the 
structure  it  requires. 

At  one  level,  the  driver  is  a  very  simple  looping 
procedure  which  presents  a  prompt  to  the  user,  and 
read?  in  the  user's  next  command.  It  determines 
what  the  command  is,  and  proceeds  to  call  the 
procedure/module  enacting  that  command.  At  a 
second  level,  it  houses  global  variables,  and  all 
utility  procedures  needed  permanently  in  memory 
for  more  than  one  module.  In  particular,  it  holds 
the  case  read/select  run  time  procedures,  for  these 
will  also  be  needed  if  in  the  future  we  add  new 
statistical  modules  which  must  also  read  the  data. 

We  find  this  loop  structure  very  useful  as  it  can  be 
extended  as  the  program  grows  by  simply  adding 
new  control  keys  and  corresponding  procedure 
calls.  It  can  also  be  replicat^  within  statistical 


procedures  themselves  to  handle  sub-commands. 

On  the  other  hand,  modules  which  are  needed  in 
short  term  only  can  be  put  outside  the  mainline, 
and  this  determines  another  level  of  structure  called 
overlays  or  segments,  in  which  independent 
procedures  alternately  use  the  same  memory  area 
when  they  are  being  executed.  The  level  of  power 
of  this  structure  is  that,  on  most  sophisticated 
compilers  on  PCs,  it  can  allow  sets  of  procedures 
to  be  constructed  and  compiled  as  independent 
units  and  stored  externally  to  the  main  program.  In 
execution,  the  mainline  procedures  remain  in 
memory  at  all  times,  loading  an  external  module 
into  memory  only  when  it  is  required.  The  external 
modules  use  the  same  physied  memory  area,  as 
each  overwrites  the  space  used  by  the  earlier  one. 
The  program  skeleton  is 

Program  StatsCrossTables; 

{Driver  loop  and  global  defintions} 
const . type . var  . 

external  procedure  Tab(ok  :  boolean; 
external  procedure  Tran(ok  :  boolean; 

{$!  SupportProcs)  {for  i/o  &  general  needs) 
{$!  SundryStatsProcs)  {common  simple  stats) 

var  {Local  defintions) 
endit,  {Quit  flag) 

OK  :  {operation  successful)  BOOLEAN; 

begin  {driver) 
endit  :=  false; 
while  not  endit  do 
begin 

writeCoutfile,  '==»'); 

readln(wd);  {user  give  'command'  keyword) 
case  left(wd,3)  of 
'qui':  endit  :=  true; 

'fir  :  openToRead(FileName,  ok); 

'tab':  Tables!  ok); 

'tra'  :  Transforms!  ok); 

'lev':  defineLevnames!  ok); 

'var':  define  Var  names!  ok); 
end  (case) 
end:  {while) 
writeCGoodbye.); 
end.  {driver) 

Such  overlaying  is  essential  in  constructing  large 
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systems  of  any  kind,  though  text  books  on  the 
subject  are  in  fact  rare.  Note  that  external  modules 
are  not  strictly  standard  in  Pascal,  but  all  modem 
compilers  like  Apple's  MPW  do  support  them  (and 
hence  our  recent  move  to  Modula2  sytems,  where 
they  are  defined  as  standard).  A  full  discussion  of 
these  matters  is  given  in  [4]. 

Essential  Structures  3: 

Sets  &  Pointers 


We  have  referred  briefly  to  sets  in  earlier  sections. 
This  program  does  not  require  any  more  complex 
usage  of  sets  than  that  used  in  the  StatsWanted 
item  of  the  TableSpec  record,  but  more  complex 
set  constructs  are  useful  elsewhere,  as  shown  in  [2] 
for  ANOVA  and  Log-Linear  models. 


We  find  pointers  very  necessary  where  we  must 
allocate  temporary  space  for  large  arrays,  usually  of 
dimension  which  cannot  be  determined  until  run 
time.  Space  is  allocated/deallocated  in  ('spare') 
heap  memory  and  allowing  ^at  flexibility  in  larger 
programs.  Relevant  here  is  the  LevelsNames 
array,  an  array  of  text  for  the  levels  of  a  'factor'  or 
classification  variable.  The  structure  is  used  within 
the  variables  information  record: 
type  VariablesRec  = 
record 


VarName  :  String[15]; 
mean  :  real;  {if/when  available) 

stddev  :  real;  (if/when  available) 

min  :  real;  {if/when  available) 

max  :  real;  (if/when  available) 

NoOfLevels  :  integer;  {0=not  factor) 
LevelsArrayPtr  :  ''LevelsArray; 
end; 

and  we  define  the  array 

Varinfo  :  array!  L.maxVars]  of  VariablesRec 
to  contain  useful  information  about  each  variable  in 
the  data  file.  The  first  six  items  are  obvious,  and 
the  last  points  to  the 

LevelsArray=  array! L.classes]  of  string!lS]; 
which  is  allocated  at  run  time  when  classes  size  is 
known.  This  structure  also  leads  to  efficiencies  of 
storage  as  many  variables  may  be  allocated  the 
same  names  (e.g.'Yes',  'No'). 


Essential  Structures  4: 
Recursive  Procedures 


must  be  passed  over  here.  Particularly  useful  ones 
in  crosstabulation  are  fully  discussed  in  [2,3]. 

Essential  Structures  5: 

Graphics  User  Interface 

Besides  space  problems,  there  are  interlocking 
reasons  for  omitting  this  ^scussion  also  -  we  find 
that  users  eventually  move  to  batch  methods  and  so 
a  'command  line'  interface,  as  here,  is  always  also 
needed.  Further,  it  is  relatively  easy  to  put  up  a 
grapics  interface  preprocessor  to  construct  the 
commands  at  a  later  time,  in  accord  with  the 
programming  Principle  of  Successive  Refinement. 

Summary 

We  indicate  in  this  note  that  structures  commonly 
used  in  developing  large  commercial  packages  by 
professional  programmers  are  equally  relevant  to 
statisticians'  programming.  We  ^so  hope  that  our 
programs  StatZ  etc.  using  these  will  be  found 
more  than  a  cerebral  exercise. 
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ABSTRACT 

This  paper  describes  how  a  flexible  object-oriented  pro¬ 
gramming  environment  can  greatly  enhance  possibilities 
for  interactive  model  building  and  for  the  choice,  anal¬ 
ysis,  and  estimation  of  models.  In  our  case  we  want 
to  model  y  =  /(ar)  with  the  non-orthodox  requirement 
that  /  have  as  few  inflection  points  as  necessary  but  / 
is  otherwise  unconstrained  (nonparametric).  One  pos¬ 
sibility  is  to  express  f"  as  f"{x)  =  ±(i  —  tui)  •••(!  — 
tuj)exp  A/(z),  and  model  hj  as  linear  spline.  There  are 
many  ways  of  deading  with  a  penalty  to  be  added  to 
the  log-likelihood,  and  of  estimating  the  different  groups 
of  parameters.  Using  the  object-oriented  CLOS-based 
Arizona  environment,  these  different  possibilities  are 
quite  easy  to  investigate. 

1  Introduction:  Penalizing  the  Proper  Rough¬ 
ness 

In  order  to  think  about  scatter  plot  smoothing,  let  us  re¬ 
consider  how  we  measure  smoothness.  For  convenience, 
we  are  going  to  measure  ‘roughness’  R[f\  of  a  function  / 
as  opposed  to  smoothness.  Our  approach  to  estimating 
/  in  the  model  yi  =  f{xi)  -b  ej,  i  =  1, . .  .,n,  (e,-  are 
i.i.d.  with  E[£,-]  =  0)  is  a  Maximum  Penalized  Likeli¬ 
hood  (MPL)  criterion  as  follows: 

n 

1  -/(*•))  +  (1) 

The  first  term  in  (1)  is  the  negative  log-likelihood  when 
Cf  are  i.i.d  with  density  g(t)  a  For  Gaussian 

errors  it  is  the  usual  residual  sum  of  squares.  Allow¬ 
ing  for  a  more  general  p,  e.g.,  using  “Huber’s  rho”, 
Pe{x)  =  (x^  —  (|a:|  -  c)^.)  /2,  we  make  sure  that  our  es¬ 
timate  of  /  will  be  robust  against  outlying  observations 
yt;  see  Machler  (1989),  and  Cox  (1983).  The  smooth- 
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ing  parauneter  A  determines  a  balance  between  fidelity 
to  the  data  (small  residuails)  and  smoothness  of  /. 

The  issue  mentioned  above  cam  now  be  stated  as 

What  roughness  penalties  I2[/]  are  appropriate  ? 

In  nonparametric  density  estimation,  the  MPL  approach 
has  been  investigated  in  much  detail;  see  Good  amd 
Gaskins  (1971),  Tapia  aind  Thompson  (1978),  and  Sil¬ 
verman  (1986),  section  5.2,  where  some  of  the  earlier 
work  is  discussed.  Although  Good  and  Gaskins  (1971) 
do  discuss  the  choice  of  penalty,  it  is  usually  selected  in 
such  a  way  that  the  subsequent  problem  is  “tractable” . 

MPL  has  a  more  recent  history  in  the  context  of  re¬ 
gression,  but  considerable  investigation  has  been  done, 
including  generalizations  for  dealing  with  generalized 
linear  models;  see,  e.g.,  O’Sullivan  et  ed.  (1986),  Gu 
(1989)  and  Wahba  (1990),  ch.  9.  Here,  the  rough¬ 
ness  penalty  is  always  assumed  to  be  the  squared  semi¬ 
norm  of  a  (lineeur)  projector  in  some  (function)  Hilbert 
space.  The  methodology  of  reproducing  kernels  leads 
then  to  simple  characterizations  of  the  solution,  some¬ 
times  cedled  generalized  splines. 

In  contrast,  we  want  to  choose  a  roughness  penalty 
according  to  a  more  queditative  notion  of  smoothness. 
The  penedty  which  leads  to  polynomial  spline  functions, 
«[/]  =  //(”*)'(*)  dt,  (m  =  2  gives  cubic  splines),  was 
originally  motivated  by  the  fact  that  is  in  some 

way  “proportional”  to  the  curvature  of  /  at  x.  Ideally, 
R[f]  would  measure  an  average  (squaired)  local  curva¬ 
ture.  The  domain  of  integration,  here  and  subsequently, 
is  the  full  range  of  {ij}.  As  seen  in  Machler  (1989) 
auid  below,  this  is  not  true  in  general.  We  have  devel¬ 
oped  a  new  roughness  penalty  trying  to  incorporate  a 
globed  notion  of  smoothness.  Approximately  measuring 
change  of  curvature  instead  of  curvature  leads  “natu¬ 
rally”  to  consideration  of  inflection  points,  i.e.,  points 
where  the  curve  goes  from  convex  to  concave  or  vice 
versa.  The  final  approach,  denoted  by  “Wp”,  can  be 
considered  a  parametric-nonparametric  hybrid:  Assum¬ 
ing  a  given  number  of  inflection  points  J,  we  consider 
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the  MPL  pioblem  where  the  roughness  penalty  now  mea¬ 
sures  the  “remaining  change  of  curvature” ,  given  J  in¬ 
flection  points;  J  can  be  varied  on  top  of  this  as  well. 

An  example  with  real  data  is  given  in  figure  1. 
This  data  set  hs.data  is  available  in  5  by  the  state- 


imately  one  might  just  look  for  /  iu  each  Ij  separately 
requiring,  e.g.,  f"  >  0,  and  f"  =  0  at  the  interval  bound¬ 
aries.  The  minimization  of  fj  f"^(t)dt  now  will  often 
yield  whole  subintervals  where  f"  fa  0,  i.e.,  /  is  locally 
almost  linear. 


Housing  Starts  in  USA  (n  =  108) :  Smoothers  with  the  same  RSS 


Figure  1:  Example  of  (deseasonalized)  housing  starts  (in 
the  U.S.)  for  108  months.  Three  different  smoothers, 
tuned  to  have  identical  residual  sum  of  squares.  “Wp” 
smoother  with  J  =  3,  i.e.,  3  inflection  points. 

mentss.hs  <-  sabl(hstart) ;  hs.data  <-  s.hs  ltrend  + 
s.hs  $irregular.  Cubic  splines,  “lowess”  (Cleveland 
(1979))  and  our  “Wp”  smoother  (with  J  =  3)  were 
tuned  to  have  identical  average  squared  residual.  Note 
that  the  two  classical  smoothers  sufier  somewhat  bom 
“erosion”  at  the  two  loceJ  minima  and  both  have  extra 
inflection  points. 

Perhaps  the  most  closely  related  work  to  ours  is  work 
on  “isotonic”  and  “constrained”  splines.  Wright  and 
Wegman  (1980),  e.g.,  prove  the  existence  of  splines  un¬ 
der  restrictions  such  as  (piecewise)  monotonicity,  con¬ 
vexity,  etc.  A  corresponding  algorithm  has  been  pro¬ 
vided  by  Irvine  et  al.  (1986)  and  put  into  the  IMSL 
library  subsequently  (as  ‘CSCON’).  Examples  of  other 
approaches  using  regression  splines  with  a  few  (possibly) 
variable  knots  were  taken  by  e.g.,  Dierckx  (1980),  and 
Ramsay  (1988). 

These  “smooth”  curves,  however,  often  are  ‘nearly’ 
piecewise  linear  with  ‘brisk’  changes  of  slope  which  is 
very  unsatisfactory.  This  behavior  is  well  explained  in¬ 
tuitively:  One  is  looking  for  a  function  /  minimizing 
/  f"^{l')dt  under  restrictions  as  {—iyf"{x)  >  0  on  some 
sub-intervals  Ij  C  [a,b].  To  solve  this  problem  approx- 


2  “Wp”:  Change  of  Curvature  Roughness,  In¬ 

flection  Points 


The  traditional  smoothing  splines  appro£u:h  is  motivated 
by  the  idea  of  penalizing  high  curvature  k  with  the 
penalty  R[f]  =  f  K(tydt.  Because  the  curvature  is  given 

by  k{x)  =  f"(x)  {l  +  ,  it  may  be  approxi¬ 

mated  by  K.(x)  «  c  •  f"{x)  (if  /'(*)  fa  const  !)  which 
leads  to  cubic  splines.  This  approximation  to  k  need  not 
be  good  in  some  cases  (see  Machler  (1989))  and  is  used 
mainly  because  of  the  simplicity  of  f"^  (and  its  corre¬ 
sponding  solution  to  the  variational  problem)  compared 
to  K^.  But  the  more  important  issue  is 

Does  high  curvature  really  mean  “roughness”  ? 

Machler  (1989)  argues  that  it  may  be  more  ‘natural’  to 
take  the  “standardized  Change  of  curvature” 

3/77(1 +  /'" 

K 

as  measure  of  roughness.  The  approximation  f"'/f"  fa 
k'/k  holds  exactly  at  all  the  local  extrema  and  inflection 
points  (the  most  interesting  points  of  the  curve)  and  is 
qualitatively  better  than  f"{x)  w  k{x)  in  many  situa¬ 
tions. 

This  approximation  for  the  relative  change  of  cur¬ 
vature  leads  to  the  preliminary  penalty  R[f]  := 
J{f"'{t)/f”(t))^  dt.  Now,  let  us  assume  that  /  has  J 
inflection  points,  say  wi,. . .  ,wj.  Equivalently,  /"  has 
exactly  the  zeros  Wj,  j  =  1,...,J.  Then  /"'/ /"  has  first 
order  poles  at  these  locations  emd  £[/]  “contains  J  times 
oo”.  But  we  cem  “rescale”  the  problem  by  expressing  the 
second  derivative  as 


f"{x)  =  sf  (x  —  iwi)(z  —  •u;2)  •  •  •  (®  ~ 

±1 

' - ... - ' 

=  P„(x)  (2) 

Here,  represents  any  function  with  no  zeros 

(€  R),  and  P«(x)  is  a  polynomial  of  degree  J.  We 
can  now  express  h/  as  h/(x)  =  log(/"(x)/P„(x))  = 
log  |/"(a:)/s/ 1  -  lo8 1®  -  I-  This  gives 

f"  ^  1 

•'  j=i  •' 
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Expression  (3)  is  /'"//”  (as  in  ^[/])  minus  all  the  poles. 
If  the  inflection  points  wi,. .  .,wj  are  unknown  param¬ 
eters,  the  choice  of  the  penalty 

R[f]  ■■=  I  /»/'(<)'  dt,  (4) 

still  penalizes  the  change  of  curvature  and  prevents  more 
than  J  inflection  points.  The  “order”  J  (the  number  of 
inflection  points)  is  assumed  to  be  given.  For  each  J  we 
have  a  class  of  functions  with  a  fixed  number  of  inflection 
points. 

In  Machler  (1989),  the  resulting  variational  problem 
was  considered  and  a  (quite  involved)  numerical  algo¬ 
rithm  for  its  solution  was  devised.  Here,  we  solve  the 
problem  in  a  restricted  (but  still  rich)  function  space  to 
make  it  more  feasible  for  inference.  Namely,  we  model 
/i/  as  a  polygon,  or  linear  spline. 

3  “Wp”  parametrized;  hf  as  linear  spline 

Assume  the  a;,-  are  ordered  and  split  the  data  interval 
into  Af  subintervals  h  =  [<*,<*+![.  by  knots  xi  =  to  < 
ti  <  . . .  <  tuf  =  x„.  We  now  parametrize  hj  as  a 
(general)  linear  spline  with  knots  {tt}.  In  each  knot 
interval,  we  set 

hf(x)  =  hk +bk(x -tk),  lot  X  €  h-  (5) 

We  want  hj  to  be  continuous  at  all  the  inner  knots,  i.e., 
hk{tk+i)  must  equal  hk+i(tk+i),  or  equivalently, 

k 

hfc+i  =  hk  +  (tfc+i  -  tk)bk  =  ho  +  ~  (b) 

j=0 

for  fe  =  0, . . . ,  M  —  1.  Therefore,  h/  is  completely  spec¬ 
ified  by  the  given  knots  and  the  parame¬ 

ters  (ho, bo, .  ■ bM-i).  Even  f(x}  itself  can  be  seen  as  a 
parametric  function  (though  semi-parametric  in  nature), 
and,  because  we  restricted  hj  appropriately,  we  can  ex¬ 
plicitly  express  /(x)  (piecewise  as  linear  plus  the  product 
of  polynomial  and  exponential).  Because  of  the  dou¬ 
ble  integration  from  /"  to  f,  there  are  two  integration 
constants  fo  and  /q  which  are  new  free  parameters  for 
/.  The  penalty,  /  h/^(t)  dt  is  trivial  to  compute,  since 
h/  is  piecewise  constant.  Our  whole  MPL  criterion  (1) 
is  now  the  minimization  of  a  function  of  the  p2irame- 
ters  (wi,...,wj),  (bo,...,bM-i),  ho,  fo  and  fo  where 
the  last  three  can  easily  be  determined  as  (linear  or  ro¬ 
bust,  depending  on  the  choice  of  p)  regression  coefficients 
(given  the  other  parameters),  and  where  we  assume  that 
the  “curvature  sign”  sj  and  the  knots  ti,. .  are 

given.  Given  data,  we  have  the  (nonlinear)  minimization 


problem  of  determining  the  {tt)j}’s  and  {6t}’s.  Also, 
some  investigation  about  the  choice  of  the  knots  {ti} 
(number  and  location)  has  to  be  done  and  we  may  want 
to  include  this  choice  into  the  minimization  probleui. 
In  the  remaining  sections,  we  will  see  that  an  object- 
oriented  interactive  graphical  system  such  as  Arizona 
is  ideal  for  investigating  this  minimization  problem. 

4  Cactus  in  Arizona 

Arizona  is  a  software  system  under  development  at  the 
University  of  Washington,  Seattle,  by  John  McDonald 
and  coworkers  (McDonald  (1988)),  based  on  Common 
Lisp  (‘defined’  in  Steele  (1990))  and  CLOS,  the  Common 
Lisp  Object  System  (Keene  (1988)).  Citing  McDonald 
and  Sannella  (1991), 

“Arizona  is  intended  to  be  a  portable, 
public-domain  collection  of  tools  supporting 
scientific  computing,  quantitative  graphics, 
and  data  analysis.” 

The  above  report  and  release  of  Arizona  are  available 
via  cinonymous  FTP  from  belgica.stat.washington.edu 
in  the  directory  pub/az.  More  documentation  can  be 
found  in  the  file  collection  doc. tax. Z.  Release 

0-0  (as  of  Feb.  91)  contains  four  modules.  Tools,  Ge¬ 
ometry,  Slate  and  Chart,  in  hierarchically  increasing 
order.  Slate  (relying  on  Geometry  and  Tools)  pro¬ 
vides  an  object  oriented  user  interface  to  bitmap  graph¬ 
ics  and  event  driven  user  input,  i.e.,  a  Lisp  toolkit  for 
XI 1,  using  CLX.  Chart  provides  output-only  high  level 
plot  functionality. 

Cactus  is  a  (not  yet  released)  module  for  numerical 
and  abstract  (!)  linear  algebra  and  optimization  par¬ 
tially  described  in  McDonald  (1989).  It  uses  all  mod¬ 
ules  above.  The  object  oriented  approach  enables  flex¬ 
ible  experimentation  with  different  optimization  meth¬ 
ods.  Both  optimizers  and  objective  functions  are  objects 
that  can  be  changed  or  replaced.  At  the  same  time, 
the  modularity  of  the  graphics  modules  allows  dynamic 
graphical  monitoring  of  the  minimization  process. 

5  “Wp”  with  Cactus 

The  interface  with  Cactus  requires  an  objective  func¬ 
tion,  i.e.,  a  Lisp  function  mapping  a  vector  of  unknowns 
(over  which  to  minimize)  into  a  rezd  number.  In  our 
case,  we  have  the  two  parameter  vectors  (wi, . . . , u;j) 
and  (to,  •  •  • ,  bm-i)-  We  now  have  (at  least)  two  possibil¬ 
ities.  First,  we  may  concatenate  the  two  vectors  into  one 
vector  of  unknowns,  and  pass  this  to  the  minimizer(s). 
Or,  we  can  use  two  objective  functions,  one  for  each  of 
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Figure  2:  Example  of  an  Arizona/CacTUS  session. 
The  two  plot  windows  dynamically  update  themselves 
displaying  the  current  state  of  minimization. 

our  vectors,  with  the  other  held  constant,  and  alternate 
minimizing  the  MPL  over  one  vector  at  a  time.  The  al¬ 
ternating  approach  may  require  more  minimization  steps 
than  the  direct  approach,  but  in  our  case,  the  MPL  cri¬ 
terion  can  be  updated  from  one  evaluation  to  the  next. 
If  only  (some)  6t’s  are  changed,  the  polynomial  Pw{x)  is 
unchanged  and  the  evaluation  of  the  objective  function  is 
quite  fast.  Using  C ACTUS  (and  modular  programming), 
experiments  like  these  are  quite  natural  to  do,  since  vec¬ 
tors,  objective  functions  and  optimizers  are  “abstract” 
objects. 

A  snapshot  of  an  interactive  session  using  Arizona  is 
given  in  figure  2.  To  the  left,  we  see  a  two  part  window 
comprising  the  Lisp  command  interface.  In  the  upper 
right  window,  the  objective  function  (the  MPL  crite¬ 
rion  (1))  is  shown  in  the  current  one-dimensional  line 
search.  Below,  the  objective  function  is  plotted  against 
time  (iteration  number).  Both  plot  windows  dynami¬ 
cally  update  themselves  displaying  th'j  current  state  of 
minimization. 

In  conclusion,  our  “Wp”  smoothing  methodology 
leads  to  a  class  of  interesting  minimization  prob¬ 
lems.  The  flexibility  and  interactive  nature  of  Ari- 
ZONA/Cactus  make  it  an  attractive  environment  for 
investigating  a  variety  of  questions  arising  in  our  class 
of  problems. 
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Abstract 

We  describe  a  collection  of  procedures,  coded  in 
Mathematica,  for  the  systematic  computation  of  asymp¬ 
totic  expansions  common  in  statistical  theory  and  practice. 
The  procedures  permit  the  expansion  of  maximum  likeli¬ 
hood  estimates,  the  associated  deviance  or  drop  in  likeli¬ 
hood,  and  more  general  functions  of  random  variables 
involving  one  or  an  arbitrary  number  of  parameters.  Gen¬ 
eral  expansions,  written  in  standard  notation,  arc  produced 
by  these  procedures  and  they  can  be  evaluated  for  a  partic¬ 
ular  distribution  through  the  specification  of  the  appropri¬ 
ate  moment  generating  function.  These  procedures  per¬ 
form  complex,  lengthy  derivations  in  a  fraction  of  the  time 
it  takes  by  hand,  with  very  little  chance  for  error.  They 
permit  the  statistician  to  concentrate  on  the  structure  of  a 
symbolic  calculation  rather  than  on  the  detail  of  term  by 
term  evaluation.  The  procedures  arc  illustrated  with 
examples  involving  general  laws. 

1  Introduction 

Much  of  statistical  theory  and  practice  is  ba.scd  on 
asymptotic  expansions.  Many  programs  arc  available  to 
assist  in  the  numerical  evaluation  of  such  expansions  but 
there  is  a  need  for  computational  tools  to  assist  in  their 
derivation  and  symbolic  evaluation. 

Symbolic  computation  is  an  underused  facility  avail¬ 
able  to  research  siatisticians.  Packages  like  Mathematica, 
Maple  and  Reduce  are  typically  used  to  perform  limited 
tasks  such  as  obtaining  derivatives  or  integrals.  Applica¬ 
tion  of  such  packages  in  more  general  problems  is  uncom¬ 
mon,  although  some  work  does  exist  (Kendall  1988  and 
1990,  Bamdorff-Nielscn  and  Blacsild  1986).  This  may  be 
due  to  the  sparsity  of  problems  that  are  broad  enough  to 
merit  the  development  of  symbolic  tools  to  solve  them. 
Deriving  asymptotic  expansions  is  such  a  problem. 

The  derivation  of  asymptotic  expansions  is  typically  a 
simple,  but  tedious,  task  usually  involving  complicated 
and  laborious  algebra.  Consider,  for  example,  the  calcula¬ 
tion  of  the  asymptotic  expansion  for  the  maximum  likeli¬ 
hood  estimate  standardized  by  the  sandwich  estimator, 

5(0) = {—  a2L(0)/a02  ie=er^  (-  i  Of.,(0)/a0i0=^;'i. 

L  is  the  log-likelihood  function  with  components  L,.  The 
parameter  0  is  scalar  and  the  expansion  is  to  include  the 
n”’  term.  An  expansion  for  the  maximum  likelihood 


estimate  may  be  accomplished  by;  expanding  the  serve 
function  in  a  Taylor  series  about  (0-0)  to  the  third  order 
and  inverting  it.  The  observed  information  may  be 
expanded  in  (0-0)  to  the  second  order  and  the  composi¬ 
tion  of  this  series  with  that  of  the  maximum  likelihood 
estimate,  while  retaining  terms  of  order  n“*,  will  give  an 
asymptotic  expansion  for  the  observed  information.  The 
expansion  for  the  estimate  of  the  variance  of  the  score, 

—  y  /^3L(0)/30le=y^,  may  be  found  in  a  similar  way. 
^  1=1 

This  must  then  be  composed  with  an  expansion  for 

(1-rjt)  ^  and  then  multiplied  by  the  expansions  for  the 
maximum  likelihood  estimate  and  the  observed  informa¬ 
tion.  When  this  is  done,  and  terms  of  order  n"'  retained, 
the  expansion  is  compete.  By  hand,  the  expansion  requires 
several  hours  of  algebra,  checking  and  rechecking.  When 
done  by  computer  the  correct  expansion  is  obtained  in  a 
couple  of  minutes, 


Law ley [AsymptoticExpansion ( (thetahat- 
theta) *Sandwich (thetahat)  "  (-1/2)  ,2] ) 
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In  general  the  task  of  deriving  asymptotic  expansions, 
though  simple,  is  very  large.  When  performed  by  hand, 
the  probability  of  error  increases  with  the  length  of  the 
expression.  Few  statisticians  would  willingly  take  on  this 
challenge.  Fewer  would  do  it  correctly.  The  computer 
procedures  presented  here  perform  these  expansions  in  a 
small  fraction  of  the  time  it  takes  by  hand. 

Asymptotic  expansions  are  useful  both  in  their  general 
form  and  in  particular  cases.  General  expansions  provide 
an  avenue  for  the  comparison  of  statistics  and  families  of 
distributions.  For  example,  in  a  robustness  study  the  above 
expression  could  be  compared  to  the  expansion  of  the 
maximum  likelihood  estimate  when  it  is  standardized  by 
the  observed  information.  In  particular  cases,  explicit  for¬ 
mulae,  such  as  Edgeworth  approximations,  are  required 
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for  application.  Such  formulae  are  easily  obtained  by 
evaluating  general  formulae  through  the  specihcaiion  of 
the  appropriate  moment  generating  function. 

Section  2  presents  a  summary  of  the  notation  and  pro¬ 
cedures  used.  Section  3  presents  applications  of  the  pro¬ 
cedures  to  derive  general  asymptotic  expansions.  Section 
4  contains  concluding  remarks. 

2  Notation  and  Procedures 


We  shall  assume  that  the  reader  is  familiar  with  the 
summation  convention  and  use  the  notation  below  which 
is  similar  to  that  of  Lawlcy(1956). 

/,  =  It  3£.i/d0,,  L„  =  n  9^Z.i/30,30,,  L,u  =  « 


L,_,  =  It  3Z,|/30,3/.i/38,.  L,,,  =  n  dZ,i/30,3^i,/30,30,.  etc., 

K.=E(LJ.  =  £(/.„).  X,.=E(L,,.),  X,,.  =£(£,.,).  etc., 

hs  ~  —  X„,  j  =  Lf  g  ~  X,  j,  —  X, j,,  etc., 

V*=X;.'. 

The  procedures  that  have  been  written  to  derive 
asymptotic  expansions,  produce  output  dial  is  peculiar  to 
the  symbolic  package  Mathematica.  A  procedure,  called 
Lawley,  was  written  to  translate  output  into  the  above 
notation  thus  making  it  readable  and  greatly  simplifying 
the  preparation  of  this  paper.  Styles  of  other  authors  can 
be  emulated.  Computer  input  is  presented  in  which  sub¬ 
scripts  are  denoted  as  lists  in  braces  so  that  they  may  be 
entered  from  a  keyboard.  For  example,  /„  is  represented 
by  l[{r,s)].  Greek  letters  are  spelled  out.  The  following 
display  illustrates  both  the  input  and  output  of  the  system. 

Lawley [ lambda [ ( r , s } 1 | 


The  only  hand  operation  required  to  produce  a  typeset  ver¬ 
sion  of  this  paper  was  to  in.sert  line  breaks  in  long  equa¬ 
tions. 


2.1  Maximum  Likelihood  Estimates 


To  obtain  an  expansion  for  the  maximum  likclihotxl 
estimate  we  use  an  algebraic  analogue  of  Fisher’s  scoring 
method.  Consider  the  algorithm  based  on 


Ailil  =  -X"/.(0(i-ll)=-X"//.(01i-2J)-/.X'‘/.(0l,-2))+  ■■■ } 
0.li|  =  0,li-ll+A0,li| 

(I) 

A9[/  ]  is  of  order  n  ^  .  The  series  converges  to  the  max¬ 
imum  likelihood  estimate  if  it  is  unique. 

The  function  MaxLikEstli]  returns  the  expansion  of  the 

-i 

maximum  likelihood  estimate  correct  to  order  n  ^ 

2.2  General  Functions 


The  change  in  log-likelihood,  2  [L  (6)  -  L  (0)]  may  be 
expanded  in  a  Taylor  series  in  ^6  and  the  maximum 
likelihood  estimate,  to  the  correct  order,  substituted.  This 
requires  a  procedure  which  calculates  expansions  of  func¬ 
tions,  whose  arguments  are  themselves  expansions,  while 
retaining  only  terms  of  the  required  order.  The  following 
definition  of  the  expansion  of  /(g)  where,  g[il  is  an 

i 

expansion  of  g  correct  to  order  n  ^  ,  can  be  used 
1  g,  [order  +  I  -  i  1) 

ExpandF[f,  g,  order]  =  £(0— ^ - rj - £(/*..*..  0  + 


‘'Jr  •  gk,  [order  -i\) 

Kn-^— 1 - Uk,.k,.  01. 


.1  >.i 


-1 

where  £(/*„*,.  ■*.)  has  order  n^. 

Repealed  use  of  this  procedure  allows  the  expansion  of 
such  functions  as  the  square  root  of  the  observed  informa¬ 
tion. 


2.3  Expected  Values 


The  expected  value  operator  is  defined  by  the  usual 
basic  relations: 

EiX  +  Y)  =  E(,X)  +  E{Y). 

E{aX)  =  aE{X), 

Eifl)  =  a 

The  only  difficulty  arises  with  terms  involving  sums  of 
arbitrary  length.  These  are  evaluated  using 

Expect  [fiix*. .]  =  Yk-^Expect  [flX*. 

*=li=J  stS^O-  t=l 

where  so  denotes  the  number  of  distinct  subscripts  in  s, 
and  S  denotes  the  set  of  ordered  sets  of  m  subscripts  s  such 
that  the  first  subscript  is  I  and  any  following  subscript  is  at 
most  1  larger  than  the  maximum  of  those  preceding.  The 


e.|oi  =  e. 


Xi^i’s  are  indcpendcnl  of  each  other  with  respect  to  i.  Here 
they  are  derivatives  of  the  log-likelihood,  although  other 
applications  are  possible.  The  definition  is  quite  general 
and  leads  to  direct  algorithmic  implemcnuition.  The  pro¬ 
duct  of  m  sums  is  evaluated  with  less  than  m!  term^, 
independent  of  n. 

The  above  procedures  are  central  building  blocks 
derivation  of  asymptotic  expansions  and  may  be  used  for 
the  to  obtain  expansions  of  many  common  statistics. 

2.4  Identities 


Reduction  of  expressions,  usually  moments  or  cumu- 
lants,  makes  use  of  many  identities  involving  expected 
values  of  the  derivatives  of  L.  All  of  the.se  follow  directly 
from  one  basic  identity; 


30r 


This  is  used  to  define 


)• 


Bartlett  identities  are  a  group  of  identities  useful  in 
simplifying  expressions.  They  equate  to  zero  linear  combi¬ 
nations  of  the  expectation  of  derivatives  of  the  log- 
likelihood  function.  The  k"'  order  Bartlett  identity  is 
derived  by, 


i.  noting  that  je“^  -^^dx  =  c, 

ii.  differentiating  both  sides,  with  respect  to  k  non- 
distinct  components  of  0, 

The  first  two  of  these  identities  are  well  known: 

K  =  0.  and 

-  ~^rs 

Machines  are  useful  for  repetitive  tasks  -  there  are 
infinitely  many  more  of  the.se  identities. 

3  Examples 


In  the  following  examples  we  derive  asymptotic 
expansions  for  the  maximum  likelihood  estimate,  the  likel¬ 
ihood  ratio  statistic  and  its  expected  value.  The  expres¬ 
sions  are  correct  to  order  n~'  and  the  examples  arc 
displayed  showing  the  computer  input  and  output. 

Lawley  (MaxLiXEst  [2  1  ) 
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This  last  expression  agrees  wilh  Lawlcy’s  equation  4 
except  for  an  error  in  the  printed  version  of  his  terra 
divided  by  6.  Our  algorithm  for  collecting  terms  is  not 
quite  efficient;  a  further  reduction  is  possible.  However 
most  of  the  reduction  from  6*  tenns  to  20  has  been 
achieved. 

Evaluating  expressions,  like  the  ones  above,  for 
specific  distributions  simply  requires  a  translation  of  the 
summation  convention  to  an  actual  sum  and  then  a  term  by 
term  evaluation  of  the  sum.  Discussion  of  such  procedures 
may  be  found  in  Andrews,  Stafford,  and  Wang(l991). 

4  Concluding  Remarks 


Symbolic  compulation  is  a  useful  tool  that  can  relieve 
statisticians  from  hours  of  tedious  and  laborious  algebra. 
The  use  of  the  above  procedures  greatly  reduces  the  likeli¬ 
hood  of  producing  errors  in  expansions.  In  fact,  their  u.sc 
lead  to  the  discovery  of  errors  in  Lawlcy(]956)  and  the 
printed  versions  of  the  Ph.D.  dis.scrtation.s  by  DiCiccio  and 
Ferguson.  These  procedures  reproduce  expressions  from 
Bamdorff-NieLsen  &  Cox(1989),  DiCiccio(1984),  Fergu- 
son(1989)  and  McCullagh(1987)  without  error.  Such  tools 
are  meant  to  accelerate  research  and  encourage  ambitious 
projects  in  the  area  of  asymptotics.  The  development  of 
symbolic  procedures  in  other  areas  of  re.scarch  is  highly 
recommended. 
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Abstract 

An  EM  algorithm  has  been  developed  for  computing  the 
m^tximum  likelihood  estimates  and  standard  errors  of  the 
accuracy  rates  (  sensitivity  and  specificity)  of  a  new 
diagnostic  test  and  an  established  reference  test,  based  on 
the  outcomes  of  both  tests  when  applied  to  individuals 
sampled  from  an  arbitrary  number  of  populations  with  dif¬ 
ferent  prevalence  rates  of  a  given  disease.  This  algo¬ 
rithm  is  heuristically  appealing  in  that  it  also  estimates  the 
prevalence  rate  within  each  source  population  and  aids  the 
perception  of  the  effects  of  numerical  constraints  im¬ 
posed  on  some  of  the  rate  parameters.  An  example  is 
given  to  illustrate  the  application  of  this  algorithm  to 
practical  clinical  situations. 

NOTE:  The  views  presented  here  are  those  of  the 
author.  No  support  or  endorsement  by  the  Food  and  Drug 
Administration  is  intended  or  should  be  inferred. 

1  Introduction 

Consider  a  new  diagnostic  test  T  which  is  to  be  evaluated 
against  an  established  reference  test  R  when  both  tests  are 
used  to  detect  a  disease  D  in  a  given  population  of  which 
each  individual  is  assumed  to  be  either  diseased  (Dl)  or 
non-diseased  (D2).  If  the  outcome  from  each  individual  is 
also  expressed  dichotomously  as  positive  (Tl)  or  negative 
(T2),  the  accuracy  of  T  may  then  be  assessed  by  its 
sensitivity  (»? )  and  specificity  )  defined  as  ij  = 
Pr(Tl|Dl)  and  |  =  Pr(T2|D2),  respectively.  These 
quantities  are  generally  referred  to  as  the  accuracy  rates 
of  T,  and  their  complements  a  =  1-  ^  and  y9  =  1- 17 ,  as  the 
error  rates  of  T.  For  the  case  in  which  the  accuracy  rates 
of  both  T  and  R  are  unknown,  Hui  and  Walter  [1] 
employed  the  standard  manmum  likelihood  method  to 
estimate  the  accuracy  rates  of  T  and  R  when  both  tests 
were  simultaneously  applied  to  individuals  sampled  from 
two  populations  with  different  prevalence  rates  of  D.  This 
method,  however,  has  been  found  to  have  too  many 
problems  to  be  practical  [2J. 

In  this  paper,  an  attempt  is  made  to  expand  the  method  of 
Hui  and  Walter  into  a  more  widely  applicable  alternative 


for  evaluating  the  performance  characteristics  of  clinical 
diagnostic  tests.  Specifically,  the  purpose  is  to  compute  the 
maximum  likelihood  estimates  (MLE’s)  and  standard  er¬ 
rors  (SE’s)  for  the  accuracy  rates  of  both  T  and  R,  as  well 
as  for  the  different  prevalence  rates  of  D  in  an  arbitrary 
number  of  populations  fNB.  Hui  and  Walter’s  formulas 
for  computing  the  MLE’s  are  limited  to  only  two  pop¬ 
ulations),  presuming  that  R  may  be  less  than  perfect  in 
accuracy  and  the  disease  state  of  each  individual  is  not 
known.  Thus,  instead  of  working  directly  vnth  the  likeli¬ 
hood  equations  per  an  EM  algorithm  [3]  has  been 
worked  out  which  is  easy  to  program  and  to  embed  with 
numerical  constraints  selectively  imposed  on  the  rate 
parameters.  This  approach  is  extremely  versatile, 
permitting  the  user  to  extend  the  applicability  of  the 
maximum  likelihood  principle  to  the  computation  of 
MLE’s  and  SE’s  for  the  rate  parameters  in  a  wide  variety 
of  cases  encountered  in  clinical  practice,  such  as:  (1) 
when  both  R  and  T  have  unknown  accuracy  rates;  (2) 
when  R  has  known  accuracy  rates;  (3)  when  both  R  and 
T  have  a  specificity  equal  to  1;  (4)  when  R  alone  has  a 
specificity  equal  to  1;  or  (S)  when  some  or  all  of  the 
source  populations  have  known  disease  prevalence.  For 
Cases  (2)  through  (5),  in  particular,  the  value(s)  of  the 
known  parameter(s)  can  be  embedded  as  numerical 
constraint(s)  in  the  EM  algorithm  set  up  for  Case  (1), 
thereby  to  yield  "constrained  estimates"  for  the  remaining 
parameters. 

2  Nature  of  Problem 

The  essential  problem  here  is  to  evaluate  the  new  test  T 
against  the  reference  test  R  by  comparing  the  MLE’s  of 
these  tests’  accuracy  rates  (namely,  17^  and  where 
0  <  »7h  <  1,  0  <  h  =  t  for  test  T  or  r  for  test  R, 

and  1<  17,  -1-^^  <2),  based  on  random  samples  of  size  N^ 
drawn  each  from  K  populations  with  prevalence  rates  , 
k=l,...,K,  where  at  least  two  w^’s  must  be  distinct  from 
each  other.  If  no  numerical  constraints  are  imposed,  the 
parameters  to  be  estimated  may  be  represented  by  the 
(K+4)-vector 

(  ^*^1  *  —  > ^r)' 
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3  Nature  of  Data 


5  Derivations  for  EM  Procedure 


when  the  sample  from  the  k-th  population  is  subjected  to 
testing  by  T  and  R,  the  outcome  data  may  be  summarized 
by  the  4-vector  of  counts 

yk  “(Xkll  >  yk12>  y  k21  >  yk22)> 

where  the  values  1  or  2  of  the  second  and  third  subscripts 
denote,  respectively,  the  outcomes  T1  or  T2  for  test  T  and 
R1  or  R2  for  test  R.  The  data  vector  y,,,  though 
observable,  is  "incomplete"  in  the  sense  of  Dempster  al. 
[3].  It  should  be  noted  that  each  y^,j  count  (i,  j  =  1,  2)  can 
be  regarded  as  a  pooling  of  two  unobservable  component 
counts  and  where  the  fourth  subscript  indexes 
the  unknown  disease  state  of  the  individual  tested,  with  1 
or  2  denoting  D1  or  D2,  respectively.  Thus,  in  contrast  to 
y,,,  the  unobservable  8-vector  of  counts 


The  overall  data,  incomplete  as  well  as  complete,  may  be 
respectively  denoted  by  a  4K-vector 

y  =  (yi.y2 . yx) 

and  by  an  8K-vector 

*  ~  (*1111 »  *1112  >  —  >  *K221>  *K222)- 

The  likelihood  for  9  given  X=x,  say,  is  proportional  to 

*k111  *k121 

L{*1«)  =  (wK»7t/3f) 

*k211  *k221 

*  (’TkA'Jr) 


*k  (*k111»  *k112>  •••  »  *k112)>  •••  >  ^ 

is  referred  to  as  the  "complete"  data  vector. 

4  Theoretical  Basis 


*k112  *k122 

*  (’■k“t»Jr)  (’■k“t^r) 


*k212  *k222 

*  (’’kCtOr)  (irk^t^r)  . 


The  EM  algorithm  developed  here  is  based  on  the  idea  of 
Dempster  ^t  jl.  (3].  To  fix  the  idea,  let  A  and  B  denote, 
respectively,  the  sample  spaces  of  the  random  vectors  X 
and  Y  with  the  associated  polynomial  distributions  P;((x|^  ) 
and  PyCyl^  ).  For  the  problem  in  question,  X  is  not  direct¬ 
ly  observable  but  some  image  of  X  =  x  e  A  can  be 
observed  in  the  form  Y(x)  =  y  e  B,  where  the  mapping  Y: 
A-»B  is  many-to-one.  Now  consider  finding  the  MLE  for 
9  utilizing  y  instead  of  x,  of  which  the  latter  is  only  known 
to  lie  in  a  subset  A(y)  defined  by  the  mapping  Y:  A-»B.  In 
this  context,  the  idea  of  the  EM  algorithm  is  to  utilize  the 
fact  that  the  likelihood  of  y:  L(y|S)  =  n„PY(y^|fl)  is 
related  to  that  of  x:  L(x|^  )  =  n,,Px(x^  \9 )  by  the  equation 


where  =1-  (k=l,...,K),  =1*  and  =1- 

(h=t  or  r).  For  the  parameters  in  to  be  estimable  it 
requires  that  K  >  2  and  that  at  least  two  of  the  tt,,  ’s  be 
distinct  from  each  other.  For  the  special  case  K=l, 
appropriate  numerical  constraints  may  be  imposed  on 
some  of  the  accuracy  rates  of  R  and/or  T. 

The  E-step  consists  of  setting  the  components  of  x  equal 
to  their  conditional  expectations  given  Y=y.  For  k  =  1,...,K, 
this  yields 


L(ylO  =2*^)4*!^) 


(n-1)  (n-1)  (n-1) 

k  nt  V 


+  T 


(n-1)  (n-1)  (n-1) 

k  “t  “r 


and  to  find  a  value  ff'  oi  9  which  maximize  logL(y|S )  by 
iteratively  maximizing  the  expected  value  of  logL(X|d) 
given  Y=y.  Specifically,  it  proceeds  by  introducing  an 
initial  estimate  and  generates  a  sequence  by 

repeating  the  following  double  step  at  each  iteration; 

E-step:  Evaluate  0(9  =  EIlogL(X|(l)|y, 

M-step:  Find  9  to  maximize  Q(^ 

Continue  until  }  converges. 


V  <")  =  V  -  X 
*k112  Xkll  *k111 


*k222  yk22  *k221  • 

The  M-step  is  executed  by  equating  to  zero  the  gradient 
vector  (a/9tf  )logL(j^''' and  solving  the  resulting 
equation  for  The  solution,  denoted  by  9*"’,  is  an 
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improved  estimate  for  d  of  which  the  components  are 
expressed  as  follows: 

=  (2,S,x„j,)/N„  k=l.  ... ,  K, 


where  No/'”  =  5;S,2jX„,/'’>  and  No/"’  =  WjXKijj''”- 
Here  No/"’  and  No2*^  are  readily  identified  as  the 
estimates  for  the  total  numbers  of  the  diseased  (Dl)  and 
the  non-diseased  (D2),  respectively.  It  is  also  noted  that 

Nd/">  +  No^'"’  =  5;N,  =  N, 

which  is  the  grand  total  of  all  individuals  tested. 

6  Standard  errors  &  Confldence  Intervals 

Following  Louis  [4],  let 

^  =  E(x|y,  ff')  =  (^1111,  ••• »  ^iii2)> 

where  ^  ^k»  ^r>  O  ^  MLE  for  6 

obtained  at  the  last  EM  iteration,  and  let  S(x,  6  )  and 
H(x,  6 )  denote  respectively  the  gradient  vector  of 
logL(x|$)  and  the  negative  of  the  associated  second 
derivative  matrbi  (also  known  as  the  curvature  matrbc). 
Then  the  observed  information  matrix  for  B  given  the  data 
vector  y  is  expressed  as 

I(^)  =  Diag{E(H(X,  e  )  |y,  -  Var[S(X,  B )  |y,  ^J} 

and  the  diagonal  elements  of  I(^)  as 

I(^k)  =  (l/^Kn)[H  -  (l/^k^K)2iSj(«Kii1«Kii2/yK«)], 
k=l,  ...,K, 

'  (^k2)1^k2i2/yk2j)l 


1(C)  =  (i/e/)Sk2,i«k.22  -  (^kl21^ki22  /ykl2)l 
+  ^m^kl12  '  (^ki11^kl12/yk(l)l- 


Being  diagonal,  the  information  matrix  I(^ )  can  be  easily 
inverted  to  ^ve  an  asymptotic  estimate  for  the  variancc- 
covariance  matrix  of  f. 

Since  the  MLE’s  (k  =  1,...,K),  C*  C*  C>  C  ®P‘ 
proximately  normally  distributed  with  large  N,  confidence 
intervals  for  these  estimates  may  also  be  easily  obtained. 
Taking,  for  example,  the  estimated  sensitivity  C  of  test  T, 
the  95%  confidence  interval  for  its  expected  value may 
be  calculated  from 

Pr{-1.%<  -  f/J/C  <  1.%}  =  0.95, 

where  is  the  standard  error  of  if,  obtained  as  the 
positive  square  root  of  1/I(i7,),  the  observed  variance  of^, . 
Standard  errors  and  confidence  intervals  for  the  re¬ 
maining  MLE’s  can  be  calculated  in  similar  fashion. 

7  Example 

The  data  in  Table  1  are  reproduced  from  Table  1  of  Gart 
and  Buck  [5].  These  data  have  been  rearranged  to  keep 
with  the  format  and  notations  adopted  in  this  paper. 

Table  1. 

Outcomes  of  VDRL  (T)  and  FTA  (R)  slide  tests  for  syphilis 
from  a  sample  of  the  population  of  Haichew,  Ethiopia 
(Source:  Buck  &  Spruyt  [6],  cited  by  Gart  &  Buck  [5]) 


k  (age  grp) 

T  e  s 
TIRI 

t  0 
T1R2 

u  t  C  0 
T2R1 

m  e  s 
T2R2 

Total 

1  (  5-14) 

1 

10 

4 

62 

77 

2  (15-24) 

5 

5 

2 

31 

43 

3  (25-34) 

14 

14 

6 

27 

61 

4  (35-44) 

20 

17 

5 

19 

61 

5  (45+  ) 

18 

9 

5 

17 

49 

T1R1  =  Positive  to  both  T  and  R; 

T1R2  =  Positive  to  T  but  negative  to  R;  etc. 


Here  the  new  diagnostic  test  VDRL  (coded  T)  was  to 
be  evaluated  against  the  reference  test  FTA  (coded  R)  for 
its  accuracy  in  detecting  syphilis  based  on  a  random 
sample  of  individuals  from  the  town  of  Maichew,  Tigrc 
Province,  Ethiopia.  The  random  sample  had  been  strati¬ 
fied  by  age  decade  beginning  with  age  5.  Following  Gart 
and  Buck  who  posited  that =  0.95  and  ^ ,  =  0.90  on  the 
basis  of  past  experience,  we  embedded  these  specified 
values  as  constraints  in  the  EM  algorithm  and  estimated 
the  appropriate  sets  of  parameters  from  each  age  group  as 
well  as  from  all  age  groups  combined.  The  results  arc 
shown  in  Table  II,  along  with  those  reported  in  Gart  and 
Buck.  We  then  remove  the  constraints  from  the  EM 
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Table  II. 

Parameter  estimates  ±  SE  for  the  data  of  Table  1. 


Age 

Source  arouo  x. 

X,  X,  X.  X.  T). 

n. 

e. 

fr 

1  .0000*. 033 

--- 

0.95* 

.85711.040 

0.90* 

2 

.0735*. 066  ---  ---  —  1.0000*. 854 

0.95* 

.89191.058 

0.90* 

Gart  &  3  — 

.2681*. 071  --  —  .8059*. 149 

0.95* 

.66811.040 

0.90* 

Buck*  4  — 

.3645*. 074  ---  .86211.099 

0.95* 

.54001.124 

0.90* 

5 

.4346*.084  .84551.091 

0.95* 

.62231.095 

0.90* 

Weighted  Mean  — 

.84591.061 

... 

.76751.024 

... 

1 

.0000*. 014 

*1 

.0000 

0.95* 

.85721.040 

0.90* 

2 

.10941.003 

-  -  - 

.99991.075 

0.95* 

.89181.004 

0.90* 

EM-de- 

3 

--- 

.26811.068  --- 

.80581.132 

0.95* 

.66801.075 

0.90* 

rived* 

4 

-  - 

.36451.072 

.86241.095 

0.95* 

.54021.085 

0.90* 

5 

--- 

.43461.081 

.84531.098 

0.95* 

.67521.096 

0.90* 

All 

.00001.016  .10271.056  , 

.26331.066  .38591.073  .43581.082 

.87921.058 

0.95* 

.75391.030 

0.90* 

EM-der#  All  .0U0±.027  .1823±.075  .4953±.083  .6779t.078  .6476±.086  .8091*. 048  .6254*. 053  .8758*. 031  .9451*. 022 


♦Constrained  with  =  0.95  and  =  0.90 
iWJncons  trained 

iterative  procedure,  thereby  to  compute  the  unconstrained 
estimates  using  data  from  all  age  groups.  The  results  are 
given  at  the  bottom  of  Table  II.  As  can  be  seen  from 
Table  II,  the  EM-derived,  constrained  estimates  are  all 
closely  comparable  to  those  estimates  obtained  by  Gart 
and  Buck.  In  addition,  the  constrained  EM  estimates 
computed  from  the  combined  data  of  all  age  groups  are 
also  seen  to  be  quite  comparable  to  the  weighted  means 
of  Gart  and  Buck.  Specifically,  the  sensitivity  and 
specificity  of  T  are  given  as  =  0.8792  ±  0.058  and  Ct  = 
’’U.7539  ±  0.030  compared  to  Gart  and  Buck’s  weighted 
means  =  0.8459  ±0.061  and  Cr  =  0.7675  ±  0.024, 
respectively.  The  question  now  arises  as  to  how  reliable 
the  constrained  estimates  are.  Let  us  address  this  question 
by  comparing  them  with  the  unconstrained  EM  estimates. 
First  of  all,  let  us  construct  a  95%  confidence  interval  for 
the  sensitivity  and  specificity  of  R  utilizing  the  standard 
errors  associated  wth  the  unconstrained  estimates  = 
0.6254  and  =  0-9451,  yielding  (0.5224,  0.7285)  and 
(0.9016,  0.9885)  for  r/,  and  respectively.  It  is  no  small 
surprise  to  find  that  none  of  the  specified  values  =  0.95 
and^,  =  0.90  is  contained  in  the  corresponding  confidence 
interval.  We  are  thus  led  to  infer  that  the  specified  values 
did  not  fit  the  data  well  and  that  the  unconstrained  esti¬ 
mates  for  the  sensitivity  and  specificity  of  T,  namely,  = 
0.8091  ±  0.048  and  Ct  =  0.8758  ±  0.031,  would  be  more 
preferable  to  the  constrained  ones.  It  also  follows  that 
most  of  the  constrained  estimates  for  the  prevalence  rates 
of  syphilis  in  different  age  groups  (or  subpopulations)  may 
have  been  underestimated  in  the  light  of  their  counter¬ 
parts  obtained  by  the  unconstrained  EM  procedure. 
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1  Introduction 

For  data  which  take  the  form  of  a  two-way  contingency 
table,  many  authors  have  examined  models  other  than 
the  coinmoii  model  of  independence  of  two  classifica¬ 
tions.  In  the  literature  are  quasi  independence  models, 
clustered  sampling  models,  intraclass  models,  and  the 
Bradley-Terry  model,  among  others.  In  order  to  esti¬ 
mate  model  parameters,  iterative  methods  are  required, 
and  this  leads  to  the  problem  of  developing  efficient  al¬ 
gorithms. 

2  Quasi  Independence  Models 

If  there  are  cells  in  a  contingency  table  which  a  priori 
have  a  zero  count,  and  there  are  cells  in  that  row  or 
column  which  have  non-zero  counts,  the  independence 
model  is  not  appropriate.  To  cope  with  this  problem  of 
.so-called  structural  zeroes,  the  notion  of  quasi  indepen¬ 
dence  is  useful.  In  a  quasi  independence  model  the  row 
and  column  classifications  are  independent,  provided  the 
cells  with  a  priori  zero  counts  are  ignored. 

Examples  given  in  the  literature  of  quasi  indepen¬ 
dence  models  include  the  random  pairing  models  of 


Larntz  and  Weisberg  (1976)  and  de  Jong,  Greig  and 
Madan  (1983),  the  mover-stayer  model  as  discussed  by 
Morgan  and  Titterington  (1977)  and  the  model  of  Lemon 
and  Chatfield  (1971),  which  is  an  alternative  to  a  Markov 
chain  model.  Larntz  and  Weisberg’s  model  can  be  ob¬ 
tained  from  Lemon  and  Chatfield ’s  by  folding  along  the 
main  diagonal. 

These  models  are  all  for  data  which  take  the  form 
of  a  contingency  t.able  with  entries  on  the  main  diagon^d 
which  are  zero  a  priori,  with  the  exception  of  Larntz 
and  Weisberg’s  model,  where  the  entries  on  or  below  the 
main  diagonal  are  all  zero  a  priori. 

In  order  to  ht  these  models  to  data,  various  meth¬ 
ods  have  been  proposed.  Iterative  proportional  fitting 
(IFF)  is  commonly  used.  IFF  requires  that  the  model 
be  log  linear,  which  is  not  the  case  for  de  Joiig,  Greig 
and  Madan ’s  model.  The  Newton-Raphson  method  con¬ 
verges  quickly,  but  it  is  not  easy  to  implement  if  the 
Hessian  is  not  diagonal.  The  Hessian  is  diagonal  for  the 
mover-stayer  model  and  de  Jong,  Greig  and  Madan's 
random  pairing  model,  but  not  for  the  other  models. 
Fixed  point  iterations  have  been  used  by  a  number  of 
authors,  but  these  can  be  slow  to  converge.  Brown 
(1974)  developed  a  method  for  dealing  with  o  priori  ze- 
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roes,  which  iterates  over  the  cells  which  are  known  to 
be  zero.  Brown’s  method  becomes  less  and  less  efficient 
with  an  increasing  iiuiubet  of  zero  cells,  de  Jong,  Greig 
and  Madan  (1983)  developed  a  method  for  fitting  their 
random  pairing  model  which  involves  a  reparameterisa- 
tioii,  then  fixed  point  iteration.  This  method  can  also 
be  adapted  to  fit  other  quasi  independence  models. 

It  can  be  shown  that  when  the  table  is  symmetric, 
the  parameter  estimates  obtained  for  Lemon  and  Chat- 
field’s  (1971)  model  are  identical  with  those  obtained 
from  fitting  the  mover-stayer  model.  This  means  that 
the  models  of  both  Lemon  and  Chatfield,  and  Larntz 
and  Weisberg  may  be  fitted  by  symmetrising  the  data 
and  fitting  the  mover-stayer  model.  Thus  a  readily  im- 
plementable  Newton-Raphson  approach  is  available  for 
these  models. 

All  the  methods  so  far  discussed  estimate  a  proba¬ 
bility  distribution  and  involve  a  multivariable  iteration. 
The  authors  have  developed  new  methods  for  fitting 
quasi  independence  models  which  require  the  solution  of 
a  nonlinear  equation  in  a  single  unknown.  This  equation 
is  readily  solved  using  Newton’s  method.  This  gives  fast, 
very  easily-implemented  methods.  No  programming  is 
required,  and  well-known  packages  such  as  Minitab  or  a 
spreadsheet  may  be  used  to  do  the  calculations. 

3  Other  Models 

In  the  clustered  sampling  model  with  clusters  of  size  two, 
if  a  number  of  clusters  are  observed,  and  each  member 
of  the  cluster  is  classified  according  to  some  character¬ 
istic,  the  data  take  the  form  of  a  two-way  contingency 
table.  Cohen  (1976)  has  given  a  method  which  requires 
a  two-stage  iteration  procedure,  with  one  stage  being  the 
solution  of  a  non-linear  equation.  Using  a  reparameteri- 
sation  the  authors  were  able  to  reduce  the  computations 
required.  A  two-stage  iteration  is  still  required,  but  only 
simple  expressions  need  to  be  evaluated  at  any  stage. 
This  work  is  reported  in  Scott  and  Wang  (1990). 

Attempts  were  made  to  develop  improved  methods 
for  intraclass  models  (see  Haber(1982))  and  the  Bradley- 
Terry  model,  without  success. 
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Abstract 

This  paper  is  concerned  with  the  open-market  transac¬ 
tions  of  corporate  insiders.  The  Securities  and  Exchange 
C  ommission  (SEC)  publishes  information  on  the  buying 
and  selling  activities  of  insiders,  which  market  analysts 
use  to  uncover  insider  sentiment  about  the  prospects  of 
their  own  corporations  and  of  the  entire  market.  How¬ 
ever,  the  law  requires  that  these  transactions  be  reported 
to  the  SEC  by  the  tenth  day  of  the  month  following 
a  transaction.  Moreover,  many  insiders  do  not  comply 
with  this  regulation.  Therefore,  the  available  data  are 
always  out-of-date  in  a  random  manner.  We  also  found 
that  the  time  lags  in  reporting  buy  and  sell  transactions 
are  different.  Since  the  available  data  are  truncated,  it 
was  necessary  to  adjust  the  observed  probability  density 
function  (pdf).  W’e  used  distributed  lag  models  to  study 
their  out-of-sample  forecasting  performance. 

1.  Introduction 

Important  officers  of  a  corporation,  called  “insiders,’" 
trade  on  the  stock  market  in  the  securities  of  the  firms 
they  work  for  according  to  their  own  hunches  about  the 
company,  the  overall  market,  personal  financial  needs 
and  other  circumstances.  Market  analysts  abstract  from 
the  personal  behavior  of  insiders  and  the  prospects  of 
the  companies  they  work  for  by  aggregating  the  data  on 
trading  by  these  investors.  Stock  market  analysts  rou¬ 
tinely  use  the  insider  trading  data  as  indicators  of  major 
trends  and  turning  points  in  the  market.  Insider  trad¬ 
ing  has  also  been  the  subject  of  many  academic  stud¬ 
ies.  The  overwhelming  majority  of  these  papers  sup¬ 
ports  the  notion  that  insider  trading  provides  valuable 
long-term  information.  For  instance,  Finnerty  (1976) 
showed  the  usefulness  of  insider  trading  information  for 
the  purpose  of  selecting  individual  securities.  The  prices 
of  stocks  purchased  by  insiders  tend  to  appreciate  faster 


than  the  stock  market,  while  securities  sold  by  these  in¬ 
vestors  tend  to  do  worse  than  the  overall  market.  A 
study  by  Seyhun  (1988),  in  turn,  showed  that  insider 
trading  provides  advanced  signals  about  the  stock  mar¬ 
ket  movements.  According  to  Seyhun,  insiders  increase 
purchases  before  stock  market  rallies  and  increase  sales 
before  stork  market  corrections. 

However,  the;,:  is  a  significant  time  lag  between  ac¬ 
tual  insider  transactions  and  their  full  reporting  by  the 
SEC.  Also,  the  inflow  of  reports  is  subject  to  ups  and 
downs  due,  for  instance,  to  deadline  effects  on  report¬ 
ing,  etc.  Moreover,  our  study  of  time  lags  in  reporting 
showed  different  distributions  for  buy  and  for  sell  trans¬ 
actions.  For  instance,  the  mean  time  lag  between  the 
actual  transaction  date  and  the  filing  date  with  the  SEC 
was  only  30.3  days  for  sales  and  as  much  as  32.6  days 
for  purchases. 

Many  of  the  above  factors  are  subjective  in  nature 
and  do  not  exhibit  steady  patterns.  This  makes  it  very- 
difficult  to  capture  fully  their  effects.  For  example,  an 
attempt  to  explain  the  insider  behavior  with  the  help  of 
Rao’s  (1965)  weighted  distributions,  as  in  V'inod  (1991), 
has  not  yielded  useful  patterns. 

Instead  of  focusing  attention  on  explaining  insider  be¬ 
havior.  in  this  paper  we  study  the  information  content 
in  insider  activity  and  its  effect  on  market  agents.  W'e 
consider  the  problem  of  forecasting  Standard  and  Poor's 
500  Index  with  the  help  of  insider  trading  data.  Thus, 
we  make  insider  transactions  a  part  of  our  conditioning 
set.  However,  the  reporting  lag  and  different  lag  dis¬ 
tributions  for  purchase  and  sale  transactions  mean  that 
the  forecast  of  major  turning  points  and  trends  based  di¬ 
rectly  on  the  initial  and  incomplete  SEC  insider  trading 
data  may  be  misleading.  Therefore,  this  paper  attempts 
to  answer  the  following  questions:  (i)  Are  the  available 
insider  trading  data  worth  using?  (ii)  How  should  we 
distill  useful  information  regarding  short-term  market 
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trends  from  insider  data?  (iii)  How  do  we  evaluate  the 
information  objectively? 

2.  Data  Analysis 

We  studied  the  phenomenon  from  January  6  through 
May  29  of  1987,  or  during  101  trading  days  (N).  W'e 
took  the  insider  trading  data  from  a  computer  tape  pro¬ 
vided  by  the  SEC.  The  tape  showed  that  corporate  offi¬ 
cers  executed  22,509  open-market  transactions  over  this 
period.  The  data  on  insider  trading  over  these  101  busi¬ 
ness  days  are  separated  into  purchase  (PUR)  and  sale 
(SALE)  transactions.  Thus,  we  visualize  two  massive 
matrices  with  rows  representing  dates  when  transactions 
occurred,  and  columns  showing  dates  when  transactions 
were  reported.  So,  pur(i,j)  and  sale(;,j)  indicate  the 
number  of  transactions  executed  on  day  t  and  reported 
on  day  j.  Note  however,  that  because  of  the  reporting 
lag.  pur(i,  j)  and  sale(t,  j)  =  0  for  all  j  <  i  4-  MINLAG, 
where  MINLAG  is  the  minimum  time  lag  between  the 
transaction  and  the  arrival  of  the  insider  report  at  the 
SEC  (at  least  one  day). 

Now  define  cumulative  sum  matrices 
N 

CUMPrR(i,j)  =  y]pur(t,i) 

N 

CUMSALE(i.  j)  =  ^sale(i,i) 

for  purchases  and  sales,  respectively,  giving  the  cumu¬ 
lative  number  of  buy  and  sale  transactions  for  day  i  as 
known  on  day  j.  For  instance,  the  elements  along  prin¬ 
cipal  diagonals  of  the  matrices  {j  =  i)  show  the  infor¬ 
mation  on  insider  buying  or  selling  activity  as  known  on 
the  same  day. 

At  any  point,  one  can  obtain  data  on  these  initial  cu¬ 
mulative  daily  numbers  with  a  time  lag  (L)  represented 
(j  ~~  ')■  The  greater  the  L,  the  more  accurate  the  data 
we  can  read  from  these  matrices.  However,  our  objec¬ 
tive  is  to  obtain  a  good  approximation  of  the  true  insider 
activity  as  early  as  possible,  because  only  the  informa¬ 
tion  that  is  not  widely  distributed  among  investors  really 
matters  in  the  marketplace.  Therefore,  we  seek  an  ex¬ 
pansion  of  the  early  data  which  would  allow  us  to  predict 
the  actual  cumulative  amounts  of  purchases  and  sales  as 
they  are  eventually  reported.  We  propose  smoothed  re¬ 
ciprocals  of  smoothed  lag  distributions  for  purchases  and 
sales  as  our  expansion  factors. 

3.  Smoothed  Expansion  Factors 

From  a  study  of  the  22.509  insider  transactions  we  con¬ 
structed  two  vectors,  separate  for  purchases  and  sales. 


showing  the  number  of  transactions  reported  to  the  SEC 
with  a  certain  time  lag  in  days.  For  example,  of  the 
8,192  buy  transactions,  only  two  reports  came  within 
one  day  of  the  actual  transaction  date.  Similarly,  out 
of  the  14,317  sale  transactions,  only  one  was  reported 
within  one  day.  Most  filing  (80  percent)  is  done  with 
a  lag  of  fifteen  days  or  more.  The  data  on  the  num¬ 
ber  of  purchases  and  on  the  number  of  sales  reported 
with  a  certain  time  lag  are  denoted  by  PUR(LAG)  and 

SALE(LAG),  respectively,  where  LAG  =  1,2 . N.  We 

obtain  separate  “unadjusted”  lag  distributions  for  pur¬ 
chases  and  sales,  denoted  by  UP(LAG)  and  US(LAG), 
by  dividing  PUR(LAG)  and  SALE(LAG)  by  the  total 
number  of  purchases  and  sales,  respectively.  These  ag¬ 
gregated  data  are  found  to  be  representative  of  the  fun¬ 
damental  lag  structure. 

In  principle  there  is  a  separate  lag  distribution  for 
each  row  of  data,  and  it  is  not  a  reliable  guide  for  any 
other  day’s  lag  distribution.  The  flow  of  data  is  er¬ 
ratic,  however,  and,  for  that  reason,  our  lag  distributions 
UP(LAG)  and  US(LAG)  are  not  sufficiently  smooth  to 
be  a  useful  guide  to  the  distribution  of  an  arbitrary 
lag.  To  overcome  this  difficulty  we  use  Tukey's  (1977) 
“smoother”  called  3RSSH.  Here,  3R  stands  for  the  mov¬ 
ing  median  of  three  consecutive  values  repeated  three 
times,  which  is  followed  by  an  “end  correction.”  SS 
stands  for  “split-smooth"  applied  twice,  and.  finally, 
H  stands  for  “hanning”  or  a  weighted  average  of  con¬ 
secutive  three  values  with  weights  0.25,  0.50  and  0.25. 
We  denote  AP(LAG)  and  AS(LAG)  these  “adjusted” 
smoothed  UP(LAG)  and  US(LAG),  respectively.  We 
use  the  reciprocals  of  AP(LAG)  and  AS(LAG).  which 
are  smoothed  again,  as  our  expansion  factors.  The 
expansion  factors  are  positive  and  decline  to  one  as 
we  get  more  complete  data  on  insider  activity,  i.e., 

lim  AP(LAG)  =  1  and  lim  AS(LAG)  =  1 

LAG— *oc  LAG— ‘oc 

Applying  the  expansion  we  get 

P(i,  j)  =  CUMPUR(i,  j)/AP(LAG) 

S(i,  ;■)  =  CUMSALE(j.j)/AS(LAG) 

where  LAG  =  j  -  ?  -f  1  for  j  >  i. 

Finally,  we  calculate  the  adjusted  purchase-to-sale  ra¬ 
tios  (PSR's)  for  each  day. 

PSR(i,j)=  P(i,j)/S(,,j)  (1) 

Statistical  properties  of  PSR’s  may  be  studied  by 
bootstrapping  surveyed  in  Vinod  (1992). 

The  following  section  discusses  an  econometric  appli¬ 
cation  of  the  above  procedure  to  the  study  of  insider 
trading,  where  the  data  on  purchase-to-sales  ratios  of 
insider  transactions  are  necessarily  truncated  because  of 
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the  lag  between  transaction  dates  and  their  reporting  to 
the  SEC. 

4.  An  Application  of  the  Expansion  to 
Stock  Market  Forecasting 

In  order  to  answer  the  questions  that  the  introduction 
poses,  we  employ  a  model  combining  insider  trading  data 
with  the  level  of  interest  rates,  represented  here  by  the 
end-of-the-day  yield  on  the  30-year  Treasury  Bond. 

Of  the  22,509  insider  open-market  transactions  exe¬ 
cuted  over  the  period  of  January  6  through  May  29,  1987, 
only  3,590  buys  and  6,887  sales  were  reported  to  the  SEC 
by  the  last  day  of  the  above  time  range.  This  particular 
subset  is  the  basis  for  our  calculations  of  PSR’s  and  for 
the  Ordinary  Least  Squares  (OLS)  estimation. 

The  reader  should  note  that  the  above  structure  of 
transactions  is  consistent  with  the  overall  pattern  of  in¬ 
sider  trading  as  reported  by  various  stock  market  fore¬ 
casters.  They  agree  that  on  average  there  are  two  sales 
for  each  purchase.  The  disparity  between  the  number 
of  purchases  and  sales  stems  from  the  fact  that  insiders 
can  obtain  shares  of  their  companies  through  non-open- 
market  transactions,  such  as  various  incentive  plans, 
pension  plans,  etc. 

The  PSR’s  are  calculated  on  a  daily  basis.  However, 
the  data  on  the  most  recent  insider  activity  is  very  lim¬ 
ited,  and  it  is  normal  for  the  reported  number  of  total 
sales  or  buys  to  be  zero.  In  the  former  case  the  PSR's 
are  not  defined,  in  the  latter,  the  ratios  are  equal  to  zero. 
Nevertheless,  in  both  instances,  we  set  such  PSR’s  to  0. 

A  researcher  faces  a  dilemma  here;  either  to  wait  for 
more  data  and  to  deal  with  more  reliable  numbers,  or  to 
make  early  predictions  and  to  risk  significant  errors.  On 
the  basis  of  empirical  tests  we  choose  the  lag  of  fifteen 
days  to  be  an  optimal  one.  For  the  15-day  lag  we  have 
enough  data  to  construct  a  sufficient  number  of  PSR’s, 
and  at  the  same  time  we  are  close  enough  to  actual  in¬ 
sider  activity  to  be  able  to  use  this  information  to  our 
advantage.  We  propose  a  new  technique  for  adjusting 
for  the  time  lag  between  the  transaction  date  and  the 
report  date. 

We  propose  the  following  so-called  rational  distributed 
lags  model  (see  Judge  et  al..  1985)  for  forecasting  the 
SfcP  500  index  denoted  by  s,. 

—  00  3l»t -I  ■+  02yt  03Pt  -H  (2) 

where  y,  is  the  end-of-the-day  yield  on  the  30-year  Trea¬ 
sury  Bond  and  pt  is  the  PSR  and  f(  is  the  error  term. 
Writing  (2)  in  terms  of  the  lag  operator,  we  have 

(1-/3]  L)s)  —  00  -f  02yt  -I-  03Pt-\h  "t-  ft  (3) 


Note,  that  dividing  both  sides  by  (1  -  0iL)  (2)  repre¬ 
sents  an  infinite  order  lag  structure  with  exponentially 
declining  weights  provided  \0x\< 

Tables  1  and  2  report  our  Ordinary  Least  Squares  es¬ 
timates  of  (2)  when  the  data  points  having  zero  PSR’s 
are  omitted.  Equation  I  refers  to  the  OLS  estimates  for 
the  expanded  PSR’s,  equation  II  has  unadjusted  PSR’s, 
and  equation  III  has  PSR’s  omitted.  Note  that  the  last 
equation  uses  the  same  input  data  matrix  as  the  first 
two. 


Table  1 


Eq. 

Coef. 

Estimate 

!  1 

St.  Error  i  f-stat. 

I 

00 

-14.39 

32.05 

-0.45 

01 

0.92 

0.04 

20.51 

02 

4.79 

5.28 

0.91 

2.10 

1.18 

1.79 

II 

00 

-5.63 

34.06 

-0.17 

01 

0.91 

0.52 

17.35 

02 

4.31 

5.62 

0.77 

03 

-0.80 

1.61 

-0.50 

III 

00 

-6.95 

32.58 

-0.21 

01 

0.91  j  0.05 

19.94 

02 

4.25  I  5.41 

0.79 

Table  2 


Eq 

F-val 

SFE 

SDFE 

- 1 

MFE 

H-M 

I 

0.952 

299.0 

13.20 

2.19 

4.70 

0.78 

11 

0.950 

425.4 

14.82 

2.30 

5.11 

0.56 

III 

0.936 

208.9 

13.72 

2.28 

4.61 

0.78 

The  results  in  Table  1  clearly  show  the  superiority 
of  the  model  containing  adjusted  PSR’s,  although  0z 
in  eq.  I  is  statistically  significant  only  at  the  10  per¬ 
cent  level,  when  we  compare  the  <-statistic  with  the  i 
tables.  It  should  be  noted,  however,  that  the  f-statistic 
for  adjusted  PSR’s  is  better  than  that  of  02.  the  yield  on 
Treasury  Bonds.  From  the  t-statistics  alone,  one  may  be 
tempted  to  omit  the  Treasury  Bond  yield  variable.  How¬ 
ever,  its  omission  leads  to  worse  overall  out-of-sample 
forecasts.  That  the  interest  rate  has  an  important  ef¬ 
fect  on  stock  prices  is  also  well  known  in  the  financial 
economics  literature  (see  Lorie  et  al.,  1985). 

The  adjusted  PSR’s.  as  expected,  vary  directly  with 
the  SfcP  500  index,  while  the  unadjusted  ratios  show 
an  inverse  relationship  to  the  stock  market.  The  former 
indicates  that  insiders  correctly  anticipate  changes  in  the 
market  direction,  and  therefore,  investors  can  learn  from 
them.  The  latter,  contrary  to  the  popular  view,  would 
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result  in  losses  to  insiders  and  those  who  follow  in  their 
footsteps. 

The  abbreviations  SFE,  SDFE,  MFE,  and  H-M  in  Ta¬ 
ble  2  stand  for  the  sum  of  out-of-sample  forecast  errors, 
the  standard  deviation  of  out-of-sample  forecast  errors, 
the  maximum  of  out-of-sample  forecast  errors,  and  the 
Henriksson-Merton  (1981)  test,  explained  below,  respec¬ 
tively. 

The  Henrikson-Merton  statistic  indicates  the  results  of 
a  nonparametric  (distribution  free)  test  of  timing.  The 
test  is  based  on  the  direction  of  the  predicted  movement, 
not  the  magnitude.  It  is  not  sensitive  to  the  distribution 
of  stock  prices,  it  does  not  assume  symmetry  in  the  abil¬ 
ity  to  make  “up”  forecasts  and  “down”  forecasts,  and 
it  allows  for  nonstationarities.  The  null  hypothesis  is 
that  the  forecaster  has  no  skills  and  forecasts  randomly. 
Therefore,  sometimes  he  can  make  correct  predictions. 

Let  A’l  and  N2  be  the  number  of  down  and  up  obser¬ 
vations,  respectively.  Thus,  the  number  of  total  observa¬ 
tions  N  —  A’l  -f  Aj.  Let  denote  the  number  of  correct 
down  predictions  that  must  be  in  the  range  of 


Uj  =  max(0,  n  -  JV2)  <  fii  <  Tnin(Ni  ,n)  =  ni 


It  has  been  shown  that  ni  has  a  hypergeometric  distri¬ 
bution 


where  m  is  the  number  of  forecasts  made,  and  r  —  n\ 
the  number  of  correct  forecasts.  It  is  argued  that  the  one 
tail  test  is  relevant  here.  One  rejects  the  null  hypothesis 
if  the  number  of  correct  forecasts  exceeds  a  number  ^*(c) 
based  on 


The  test  gives  a  confidence  score  on  a  scale  of  0  to  1, 
with  a  high  score  for  procedures  that  predict  the  direc¬ 
tion  most  accurately.  Our  tabulated  results  suggest  a 
confidence  score  of  0.78  for  the  adjusted  PSR’s.  It  is 
based  on  the  nine  out-of-sample  forecasts.  By  contrast, 
for  unadjusted  data  the  confidence  score  is  only  0.55. 
Hence,  the  H-M  test  supports  the  usefulness  of  the  ex¬ 
pansion. 

SFE,  SDFE.  and  MFE  statistics  are  smaller  for  the 
equations  containing  the  expanded  ratios  than  those  for 
the  model  having  unadjusted  PSR  s.  Similarly,  the  com¬ 
parison  of  these  statistics  for  the  equations  11  and  III. 
except  for  MFE,  is  favorable  for  the  model  with  adjusted 
PSR  s.  Thus,  our  study  generally  shows  that  the  inclu¬ 
sion  of  adjusted  PSR’s  improves  the  stock  market  fore¬ 
cast. 


Conclusions 

Our  paper  illustrates  the  benefits  one  can  obtain  from 
the  application  of  smoothing  techniques  developed  in 
the  context  of  robust  statistical  estimation.  Expanded 
PSR’s  (1)  not  only  offer  a  better  short-run  forecast,  but 
they  give  researchers  the  correct  picture  of  insider  senti¬ 
ment.  On  the  other  hand,  the  unadjusted  data  show  no 
statistically  significant  relationship  wdth  the  stock  mar¬ 
ket.  Therefore,  we  posit  that  adjusted  PSR's  should  be 
included  in  models  forecasting  the  stock  market  in  the 
near-term. 
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Abstract 

In  a  study  of  two  sensors  polling  data  on  emitting 
targets  one  sensor  may  observe  a  target  while  the  other 
may  fail;  even  if  both  sensors  observe  a  target,  then 
there  is  a  random  noise  that  distorts  the  picture.  In 
this  paper  a  general  algorithm  is  developed  for 
detecting  the  pairs  of  observations  made  by  the  sensors 
on  same  targets  and  for  fusing  each  pair  as  a  single 
target. 

1  Introduction 

MICOM/TACOM  initiated  in  1982  the  Setter 
program  in  which  three  sensors,  namely,  r2wlio 
frequency  interframeter,  non-imaging  and  radar  poll 
emitting  targets,  detect  for  potential  threat  in  an  air¬ 
land  battle  scenario,  and  provide  the  operator/gunner  a 
synergistic  effect  through  a  microprocessor  based  data 
management  system  and  an  integrated  display  with 
enhanced  real-time  integrated  information.  The  sensors 
exhibit  variation  in  their  detecting  capabilities  of 
different  types  of  targets  and  also  under  different 
terrain  and  weather  conditions.  The  complexity  of  the 
problem  arises  when  the  different  sensors  detect  the 
same  targets  with  varying  noise  levels  or  one  sensor 
detects  a  target  while  others  may  fail  to  detect.  The 
first  task  is  to  determine  the  number  and  positions  of 
targets  from  the  data  collected  independently  by 
multiple  sensors;  this  should  take  into  consideration  the 
fact  that  a  single  target  may  be  distorted  as  multiple 
targets  by  sensors  and  vice  versa  in  the  presence  of 
random  noise  accompanying  the  data.  Once  the 
resolution  of  targets  is  accomplished  by  the  processor, 
the  task  of  determining  the  nature  and  priority  will 
follow. 

In  section  2  we  define  the  problem  and  set  forth 
some  criteria  and  their  rationale  for  fusion  of  the  data. 
In  section  3  we  present  a  general  discussion  of  the 
problem  which  leads  to  the  development  of  an 
algorithm  given  in  section  4.  This  algorithm  finds 

*Research  supported  by  US  MICOM  DAAH01-82-D- 
A008  while  at  the  University  of  Alabama  in  Huntsville. 


the  number  and  locations  of  targets  from  the  data 
received  from  two  independent  sensors  for  identification 
of  targets.  Finally,  in  section  5  we  discuss  the  lines  of 
research  to  be  followed  in  the  future. 

2  Criteria 

Let  Xj,  i  =  1,  .  .  .  ,  n  be  observations  detected  by  a 
sensor,  and  Yj,  j  =  I,  .  .  .  ,  m  be  observations  by 
another  sensor.  We  assume  X’s  and  Y’s  are  normally 
and  indepiendently  distributed  with  true  positions  of  the 
targets  as  their  means  and  known  standard  deviations 
(Ti  and  ^2  respectively.  These  observations  may 
indicate  that  there  are  at  least  the  maximum  of  m  and 
n  targets,  but  not  more  than  m-Fn  targets.  Once  a  pair 
(X„  Yj)  is  isolated  for  fusion  and  determined  as  a 
single  target  detected  by  both  sensors,  then  an  efficient 
estimator  is  given  by  a  weighted  mean 

{02^  X,  -F  02^).  (2.1) 

We  present  the  following  criteria  for  possible  fusion  of  a 
pair  (X,,  Y^): 

1.  Compatibility  X,  and  Yj  are  said  to  be  compatible 
or  matchable  if 

I  X,-  -  Y^l  <  za  Jo-i*-F  0-2^  (2.2) 

with  preassigned  probability  1  —  a,  where  Zq  is  the 
value  of  the  standardized  normal  variate 
corresponding  to  the  tail  probability  a/2.  Otherwise, 
they  are  said  to  be  incompatible  and  represent  2 
targets. 

2.  Monogamy  Given  Y^  there  is  at  most  only  one  Xj 
that  can  be  fused  with  Y^. 

3.  Maximum  number  of  fusions  The  number  of  fusions 
of  X’s  and  Y’s  subject  to  the  compatibility  and 
monogamy  criteria  is  maximized. 

4.  Least  square  error  fLSl  The  optimality  condition  for 
selecting  one  of  several  X’s  compatible  with  a  given 
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X, 


and  one  of  several  Y’s  compatible  with  a  given 
is  that 


X;(X,-  (2.3) 


we  arrive  at  contiguous  figurations  in  which  each  Y  has 
at  least  two  X’s  compatible  and  certain  geometrical 
properties  will  hold  good. 


where  the  summation  is  taken  over  all  possible  sets 
of  pairs  (X,-,  Yy)  that  satisfy  the  preceding  criteria. 
This  is  simply  a  Euclidean  distance.  If  we  use  a 
different  metric  for  least  square  error,  we  get  an 
entirely  different  set  of  fusions. 

The  error  probability  a  can  be  so  chosen  that  the 
overall  error  probability  of  misclassifying  some  single 
targets  as  multiple  ones  being  bounded  by  a  multiple  of 
a  meets  a  certain  threshold.  The  monogamy  criterion 
is  to  rule  out  the  possibility  of  a  sensor  seeing  an  object 
as  two  images.  This  may  eliminate  studying  a  possible 
malfunction  of  a  sensor  seeing  double  vision  from  our 
present  analysis.  The  criterion  of  making  maximum 
number  of  fusions  reduces  a  of  viewing  a  single  target 
as  multiple  targets  based  on  two  sensors.  Finally  the 
optimality  condition  is  the  same  as  the  well-known 
least  square  error  principle  in  statistics. 

3.  Discussion 

Without  loss  of  generality,  X’s  and  Y’s  are  sorted  in 
a  nondecreasing  order.  They  are  all  plotted  on  a  real 
line  and  for  each  Y^  associate  all  X’s  that  satisfy  (2.2). 
For  illustration,  let  Yjbe  compatible  with  Xj,  X2  and 
X3,  Y2  with  X2  and  X3,  Y3  with  X3,  X4  and  X5,  and 
so  on.  This  information  can  be  presented  in  a  n  x  m 
table  like  Figure  3.1  given  below.  Compatibility  of 
(X,,  Y j)  is  noted  in  Figure  3.1  by  a  symbol  such  as  ♦, 
-(-,  •,  X,  and  □. 

Observe  X7  is  not  compatible  with  any  Y,  i.  e.,  X7  is 
observed  only  by  the  first  sensor;  similarly,  Yjg  is 
observed  only  by  the  second  sensor.  Xg  is  the  only  one 
compatible  with  Yg  so  we  can  fuse  them  into  a  single 
target.  On  the  other  hand,  Xg  is  compatible  with  Y3 
and  Y4.  Since  Y3  has  two  other  compatible  X’s,  Xg 
should  be  fused  with  Y4  in  spite  of  the  fact  that  Xg 
may  be  closer  to  Y3  than  Y4;  otherwise,  Y4  will  have 
no  matches  thus  violating  the  third  criterion  of 
achieving  maximum  number  of  fusions.  Now  that  Xg 
is  fused  with  Y4,  X4  is  automatically  fused  with  Y3. 
In  case  of  Xg  it  has  3  possible  Y’s  for  possible 
matching.  Obviously  the  Y  closest  to  Xg  will  be 
chosen  for  fusion. 

After  we  have  matched  the  X’s  and  Y’s  in  the 
discussion  above,  do  recursively  this  obvious  type  of 
matching  and  dropping  the  matched  or  fused  pairs  as 
well  as  those  X’s  and  Y’s  that  cannot  be  matched,  until 
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Figure  3.2  given  below  is  a  good  example  of  table 
pruned  for  obvious  matches  and  nonmatches.  Let  us 
consider  the  portion  of  the  table  with  7  rows  and  4 
columns  giving  rise  to  the  first  contiguous  block.  Since 
there  are  3  compatible  X’s  for  Yj,  2  for  Y2,  3  for  Y3 
and  3  for  Y4,  we  have  3x2x3x3  =  54  possible  sets  of  4 
pairs.  However,  by  the  monogamy  criterion  this 
number  is  reduced  to  19  which  can  be  enumerated  as 

2345,  2346,  2347,  2356,  2357,  2365,  2367,  2456, 

2457,  2467,  2465,  3456,  3465,  3457,  3467,  4356, 

4357,  4365,  4367 

where,  e.  g.,  2345  denotes  the  set  of  pairs  (X2,  Yj), 
(X3,  Y2),  (X4,  Y3)  and  (Xg,  Y4).  It  can  be  shown  that 
the  LS  criterion  implies  that  X,<Xj<  Xt<  X,  since 
their  counterparts  satisfy  Yi<  Y2<  Y3<  Y4.  We  call 
this  principle  seniority  protocol,  that  is,  if  Y’s  are 
sorted  in  increasing  order,  then  their  corresponding  X’s 
are  also  sorted  in  increasing  order.  By  this  principle 
we  can  eliminate  7  of  19  sets,  e.  g.,  4365.  Among  the 
12  sets,  only  two  sets  contain  a  maximum  number  of 
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X’s  that  are  nearest  to  their  Y’s  so  that  the  least  square 
error  will  be  smaller.  They  are  2346  and  3456  for 
which  we  need  to  compute  the  LS  error  and  choose  the 
one  with  a  smaller  value. 

An  easy  way  of  arriving  at  these  two  sets  is  to  begin 
with  3446  composed  of  the  closest  X’s  to  Y’s.  To  avoid 
duplication  of  the  second  and  third  digits,  there  are  two 
possible  choices,  namely,  either  to  decrease  the  second 
digit  by  one  or  increase  the  third  one  by  one.  When 
the  second  is  decreased  by  one,  which  is  the  same  as  the 
first  digit,  then  we  decrease  the  first  by  one  thus 
obtaining  2346.  Similarly,  by  increasing  the  third  digit 
by  one  in  3446  we  obtain  the  second  possible  set  3456. 

In  the  second  contiguous  block  of  Figure  3.2 
consisting  of  rows  from  8  to  16  and  columns  from  5  to 
7,  we  may  arrive  at  3  possible  sets  for  verification  of 
the  LS  criterion  respectively.  This  is  one  of  the  worst 
Ccises  one  may  encounter  for  verification  of  the  LS 
criterion.  It  can  be  established  that  the  number  of 
possible  sets  of  pairs  to  be  checked  for  the  LS  criterion 
is  at  most  the  minimum  of  number  of  X’s  and  that  of 
Y’s  in  a  contiguous  block. 

Using  these  observations  we  formulate  a  general 
algorithm  in  the  next  section. 

FIGURE  3.2 


Y’s  into  three  sets  A,  B  and  C,  where  A  is  the  set  of 
matchable  pairs  (X,,  Y^),  while  B  consists  of 

unmatched  X’s  and  C  of  unmatched  Y’s.  Then  the 
number  of  distinct  targets  is  the  sum  of  the  sizes  of 
these  sets.  We  present  the  algorithm  in  a  language-free 
step-by-step  format: 

1.  Screen  the  data  and  categorize  them  into  two 
arrays  X(l:  n)  and  Y(l:  m). 

2.  Sort  X(l:  n)  and  Y(l:  m)  in  a  nondecreasing 
order. 

3.  Associate  with  each  Yj,  fj,  1^  and  c^,  the 
subscript  of  the  first,  last  and  closest  compatible 
X  respectively.  Form  a  linked  list  of  distances  of 
compatible  X’s  from  Y^. 

4.  Do  recursively  until  each  Y  has  at  least  2 
compatible  X’s: 

i.  Remove  Y’s  with  no  compatible  X’s  and  place 
them  in  set  C. 

ii.  If  a  Y  has  only  one  compatible  X^  then 
consider  the  set  of  all  Y’s  which  arc 
compatible  only  with  X,.  Match  the  closest 
Y  with  X<.  Remove  and  place  this  pair  in 
set  A.  Remove  all  other  unmatched  Y’s  and 
and  place  them  in  set  C. 
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4.  Algorithm 

The  algorithm  we  develop  here  will  separate  X’s  and 


5.  Partition  the  remaining  X’s  and  Y’s  into 
subgroups  such  that  each  Y  in  a  subgroup  has  at 
least  two  compatible  X’s  and  each  X  has  at  least 
two  compatible  Y’s  within  the  subgroup. 
Moreover,  f^  <  l^_j  is  true  for  all  j  with  Yj  in  the 
subgroup. 

6.  Without  loss  of  generality,  we  assume  a  subgroup 

consists  of  X(l:  p)  and  Y(l:  q).  Consider  the 
array  (1:  q)  of  subscripts  of  closest  X’s  to  Y’s.  If 
elements  of  this  array  are  distinct,  then  we  have 
found  the  set  of  matched  pairs  ^j)- 

Otherwise,  from  c(l;  q)  we  form  several  arrays 
such  that  the  resulting  ones  contain  as  many 
elements  of  c(l:  q)  as  possible  and  other  elements 
are  closer  to  those  of  corresponding  positions  in 
c(l:  q).  The  elements  of  these  arrays  are  such 
that  they  are  nondecreasing  and  conform  to  the 
monogamy  and  maximum  number  of  matches 
criteria. 

7.  Verify  for  each  array  obtained  in  step  6  the  LS 
criterion  and  select  the  one  with  the  least  value. 
Remove  the  X’s  and  Y’s  corresponding  to  the  set 
of  the  matched  pairs  and  place  them  in  set  A. 
Remove  unmatched  X’s  and  place  them  in  set  B 
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while  unmatched  Y’s  are  removed  and  place  in  set 
C. 

8.  Repeat  steps  6  and  7  for  all  subgroups. 

In  implementing  this  algorithm  one  can  combine 
several  steps  and  do  them  in  a  single  step.  The 
algorithm  employs  essentially  the  backtracking  strategy 
with  criteria  as  bounding  functions. 

The  space  complexity  is  not  of  concern  since  n  will 
be  less  than  50  and  m  less  than  20.  The  time 
complexity  includes  times  for  screening,  sorting  and 
computation  of  the  LS  criterion.  It  is  estimated  as  of 
order  0(mn). 

5.  Extension 

There  are  two  aspects  to  be  considered  for  the 
algorithm  to  be  useful: 

1.  Extending  to  the  three-dimensional  observations; 

2.  Fusing  the  observations  from  more  than  2  sensors. 
Our  discussion  has  been  centered  on  real-valued  X’s 
and  Y’s.  With  some  modifications  we  can  extend  the 
algorithm  to  the  case  when  the  observations  are  real 
vectors. 

As  an  extension  to  vector  observations,  an  efficient 
estimator  for  the  target  position  vector  when  a  fusion 
of  two  observations  made  by  two  sensors  is  made, 
involves  a  priori  knowledge  of  the  covariance  matrix 
of  observations  for  each  sensor.  One  possible  and 
simple  suggestion  is  to  use  (2.1)  for  each  component  of 
the  target  position.  There  are  several  estimators  which 
are  based  on  different  estimation  criteria. 

The  compatibility  criterion  given  in  (2.2)  can  be 
defined  as  (1  — a)  probability  ellipsoid  region  centered 
at  the  origin  with  the  axes  determined  by  the  sum  of 
the  covariance  matrices  of  the  two  sensors.  The  least 
square  error  defined  in  (2.3)  can  be  replaced  with  any 
other  metric. 

In  order  to  use  the  algorithm  the  metric  defined  in 
(2.3)  should  lead  us  to  define  an  ordering  on  X’s  and 
Y’s  for  sorting  and  for  maintaining  the  seniority 
protocol.  This  will  be  a  study  for  the  future. 

From  fusion  of  data  obtained  from  two  sensors  we 
go  to  fusion  of  data  from  three  sensors.  This  may  be 
done  in  two  stages  —  fusing  the  data  for  the  first  two 
sensors  and  then  fusing  them  with  the  third  sensor. 
However,  this  will  introduce  a  large  error  probability. 
This  requires  a  further  research  in  controlling  error 
probabilities. 


No  algorithm  is  applicable  unless  it  is  implemented 
on  a  processor  and  real  time  for  computations 
performed  should  be  studied.  The  question  of 
practicality  in  a  simulated  battle  scenario  is  to  be 
explored,  which  will,  in  turn,  force  us  to  refine  our 
algorithm. 

Since  the  targets  are  moving  and  the  sensors  are 
continuously  polling,  the  fusion  of  the  data  can  be 
dynamically  verified  and  updated.  This  would  be  an 
immediate  line  of  extension  of  this  study. 

Finally,  the  results  obtained  in  the  study  have 
applications  in  areas  such  as  medicine  where  multiple 
sensors  may  be  used  for  monitoring  patients’ 
conditions.  How  the  brain  of  an  animal  processes  the 
information  from  the  data  received  through  several 
senses  will  be  another  application.  Fussy  logic  which 
seems  to  play  a  larger  role  in  electronics  and  its 
applications  to  photography  can  be  interfaced  with  the 
multiple  sensors. 

The  author  acknowledges  generously  the  US  Army 
Missle  Command  and  Mr.  Richard  Jones  for  his  patient 
and  helpful  briefings  and  suggestions. 
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Abstract 

The  performance  of  control  charts  is  usually  evalu¬ 
ated  by  assuming  a  step  change  in  the  process  mean. 
However,  it  is  more  appropriate  to  evaluate  the  perfor¬ 
mance  of  control  charts  by  assuming  a  drift  in  the  mean 
for  processes  where  a  gradual  drift  models  the  shift  in  the 
mean  more  accurately.  Three  major  methods  for  com¬ 
puting  the  average  run  length  (ARL)  of  control  charts 
assuming  a  step  change  in  the  mean  are  reviewed.  Gen¬ 
eralizations  of  these  methods  for  computing  the  ARL 
of  control  charts  assuming  a  drift  in  the  mean  are  then 
examined. 

KEY  WORDS:  Average  run  length;  Cumulative  sum; 

Exponentially  weighted  moving  aver¬ 
age;  Integral  equation;  Linear  drift; 
Markov  chain;  Normal  distribution; 
Statistical  process  control. 

1  Introduction 

Let  xi,!],  ...,  be  asequence  of  independent  and  iden¬ 
tically  distributed  measurements  of  quality  from  a  man¬ 
ufacturing  process  and  /f,(x)  be  the  probability  density 
function  of  xi  where  /i  is  the  mean  of  zi.  Without  loss 
of  generality,  assume  that  the  in-control  process  mean  to 
be  zero  and  the  standard  deviation  of  xi  to  be  one. 

An  upper-sided  cumulative  sum  (CUSUM)  chart  is 
obtained  by  plotting 

St  =  max{0,5,_i  -b  -  i}, 

against  the  sample  number  t  for  t  =  1,2,...  where  k  is 
a  positive  chart  parameter  and  5o  =  u,  0  <  u  <  h. 
An  out-of-control  signal  is  issued  at  the  first  t  for  which 
St  >  h.  A  lower-sided  CUSUM  chart  is  obtained  by 
plot  ting 

T,  =  min{0,T|_i  -b  if  +  k), 


against  the  number  t  for  t  =  1,2,...  where  Tq  =  u, 
— A  <  u  <  0.  An  out-of  control  signal  is  issued  at  the 
first  t  for  which  Tt  <  —h.  A  two-sided  CUSUM  chart 
is  obtained  by  running  the  lower-sided  and  upper-sided 
CUSUM  charts  simultaneously. 

An  exponentially  weighted  moving  average  (EWMA) 
chart  is  obtained  by  plotting 

Qf  =  (1  —  X)Qt-i  +  Aif, 

against  t  =  1,2,...,  where  A  is  a  smoothing  constant 
such  that  0  <  A  <  1  and  Qo  =  u,—h  <  u  <  h.  An 
out-of-control  signed  is  issued  by  an  EWMA  cheirt  when 
Qt  <  ~h  or  Qt  >  h. 

The  run  length  of  a  control  chart  is  defined  to  be 
the  sample  number  when  an  out-of-control  signal  is  first 
issued.  The  ARL  of  a  control  chart  which  is  the  expec¬ 
tation  of  run  length  is  often  used  as  a  measure  of  per¬ 
formance  of  a  control  chart.  Three  major  methods  for 
computing  the  ARL  of  control  charts  assuming  a  step 
change  in  the  mean  are  reviewed  in  Section  2.  Gener¬ 
alizations  of  these  methods  for  computing  the  ARL  of 
control  charts  assuming  a  drift  in  the  mean  are  then  ex¬ 
amined  in  Section  3. 

2  Step  Changes 

A  common  method  for  computing  the  ARL  of  a  con¬ 
trol  chart  assuming  a  step  change  in  the  mean  is  through 
the  use  of  integral  equation.  The  ARL  function  of  a 
CUSUM  chart  was  first  derived  by  Page  (1954)  as  an 
integral  equation 

L(u)  =  l-bL(0)Pr(;f  <  t-u)-b  r  L{x)f^{x+k-u)dx, 

Jo 

where  L(u)  denotes  the  ARL  of  a  CUSUM  chart  given 
that  So  =  u. 

Using  an  argument  similar  to  Page  (1954),  the  ARL 
of  an  EWMA  chart  was  expressed  by  Crowder  (1987)  as 
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an  integral  equation 
L{u)  =  1  +  i 

where  L(u)  here  denotes  the  ARL  function  of  a  two-sided 
EWMA  chart  given  that  Qq  =  u. 

The  CUSUM  integral  equation  can  be  approximated 
numerically  by  replacing  the  equation  with  a  system  of 
linear  algebraic  equations  using  a  Gaussian  quadrature 
and  solving  the  system  of  linear  equations.  A  compre¬ 
hensive  discussion  of  methods  used  to  obtain  approxi¬ 
mate  solutions  to  integral  equation  can  be  found  in  Baker 
(1977). 

Let  wi,W2^  ...,  lUn  and  ui.uj,  ...,  u„  be  the  n-point 
weights  and  abscissas  of  a  Gaussian  quadrature  such  that 
h  *» 

g[x)dx  !sy~'  wig(u4). 
i=l 

Using  the  Gaussian  quadrature,  the  CUSUM  integral 
equation  can  then  be  replaced  by  a  system  of  linear  equa¬ 
tions  in  n  +  1  unknowns  L(ui),  ...,  L(u„+i), 

L(uj)  «  1  +  1(0) Pt(X  <k-Uj) 

n 

+  ti>,JL(u,)/^(ui  +  k~Uj) 

i=l 

where  j  =  1,2,  ...,n,n  -b  1  and  =  0  <  ui  <  uj  < 
...  <  Un  <  h.  The  ARL  function  L{u),  0  <  u  <  h  can 
then  be  approximated  as 

L(u)  »  1  -1-  I(u„+i)  Pr(.Y  <k-u) 

n 

+  Y2  +  k-u). 

1=1 

The  EWMA  integral  equation  can  be  approximated  in  a 
similar  manner. 

The  second  method  for  computing  the  ARL  of  a  con¬ 
trol  chart  is  to  use  standard  results  from  the  Markov 
chain  theory.  This  method  is  first  proposed  by  Brook 
and  Evans  (1972).  Consider  an  EWMA  chart  with  chart 
limits  {-h,h).  This  interval  is  partitioned  into  n  subin- 
tervals  and  let  R  be  the  matrix  containing  the  one-step 
transition  probability  for  the  transient  states.  The  ARL 
vector  u  =  (ui,U3, ...,  u„)^  is  then  given  by 

u  =  (i-R)-‘i 


where  1  is  the  n  x  n  identical  matrix  and  I  is  an  n  x 
1  vector  of  I’s.  The  Markov  chain  method  may  also 
be  used  in  a  similar  manner  to  compute  the  ARL  of  a 
CUSUM  chart. 

The  third  method  is  Monte  Carlo  method  which  is 
easily  programmed  but  highly  inefficient.  These  three 
methods  allow  the  ARL  of  a  control  chart  to  be  evalu¬ 
ated  for  any  particular  value  of  fi  and  hence  allow  the 
performance  of  control  charts  to  be  evaluated  assuming 
a  step  change  in  the  mean. 

3  Linear  Drift 

The  Monte  Carlo  method  can  be  generalized  easily 
to  handle  the  case  when  the  process  mean  is  drifted  and 
will  not  be  discussed  any  further  in  this  paper. 

The  performance  of  CUSUM  charts  under  linear 
drifts  in  the  process  mean  was  investigated  by  Bissell 
(1984).  It  is  assumed  that  the  first  sample  is  taken  when 
the  mean  is  in  control  and  the  mean  is  drifted  gradually 
at  a  rate  of  Acr;p  per  sampling  interval  where  <tj(  is  the 
standard  deviation  of  sample  mean  and  A  is  a  positive 
constant.  Based  on  a  modification  of  the  Markov  chain 
method  developed  by  Brook  and  Evans  (1972),  Bissell 
computed  the  ARL  of  a  CUSUM  chart  under  a  linear 
drift.  A  nonhomogeneous  Markov  chain  is  obtained  for 
the  linear  driR  case.  However,  Bissell  noted  in  a  corri¬ 
gendum  that  the  ARL  computed  using  his  Markov  chain 
method  is  not  accurate,  possibly  due  to  rounding  er¬ 
rors.  Based  on  simulation  results,  Bissell  showed  that 
the  ARL  computed  using  his  Markov  chain  method  is 
at  least  two  times  larger  than  the  actual  ARL  for  a 
CUSUM  chart  with  h  =  5.0,  k  =  0.5  and  a  drift  co¬ 
efficient  A  =  0.005.  The  accuracy  of  the  Markov  chain 
method  has  been  greatly  improved  due  to  a  refinement 
of  Asbagh  (1985). 

Gan  (1991a,  1991b)  generalized  the  integral  equation 
method  to  handle  the  case  when  the  mean  is  drifted.  Let 
the  mean  be  when  random 

samples  of  products  are  taken  from  a  production  process. 
Note  that  /i„,  can  be  set  to  an  arbitrary  large  number  to 
approximate  a  linear  wear  process.  Let  the  sample  mean 
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be  independently  distributed  with  probability  density 
function  Note  that  /int  is  the  stabilized  process 

mean.  Suppose  that  j  =  0, 1,2,  ...,m  is  the 

ARL  of  a  two-sided  EWMA  chart  given  that  Qq  =  u  and 
random  samples  of  products  aire  taken  when  the  mean  is 
Ai  -  •  Gan(  1991a)  showed  that 

=  1 

for  j  =  0, 1,2,  ...,m  -  1  and 

Lm{Uj  /^m)  —  1 

The  last  equation  is  an  integral  equation  and  the  ARL 
function  Lm(u,/im)  can  be  approximated  numerically  by 
replacing  the  equation  with  a  system  of  linear  algebraic 
equations  using  a  Gaussian  quadrature  and  solving  the 
system  of  equations.  Once  the  ARL  function  Lm(u,/im) 
is  found,  simple  substitution  method  may  be  employed  to 
compute  Lj{u,(ij),  j  =  m-l,m-2,  ...,2, 1,0  recursively. 
Note  that  Lo(0,/.«o)  is  the  ARL  of  an  EWMA  chart  with 
Qo  =  0,  assuming  that  the  first  sample  is  taken  when  the 
mean  is  at  fio  and  subsequent  samples  are  taken  when  the 
mean  is  at  fij  for  j  =  1,2, ...,  m.  Similar  ARL  equations 
for  a  CUSUM  chart  assuming  a  drift  in  the  mean  are 
obtained  by  Gan  (1991b). 

4  Conclusions 

Three  major  methods  for  computing  the  ARL  of 
CUSUM  and  EWMA  control  charts  under  step  shifts 


and  linear  drifts  are  reviewed  in  this  paper.  Both  the 
Markov  chain  and  integral  equation  methods  yield  ac¬ 
curate  ARL  values  of  control  charts  under  linear  drifts. 
The  methods  discussed  in  this  paper  may  also  be  used 
to  study  the  run  length  properties  of  control  charts  with 
drift  that  is  not  linear  in  nature. 
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1.  Introduction 

Analytical  problems  in  a  number  of  scientihe  dis¬ 
ciplines  concern  comparisons  of  two  sets  of  distances 
among  labelled  points.  These  two  sets  of  distances  (or 
metrics)  correspond  to  different  Euclidean  representa¬ 
tions  of  the  points.  The  comparison  between  these  two 
sets  of  distances,  or  between  the  two  configurations  of 
points  in  space,  is  often  expressed  best  in  terms  of  a 
deformation  that  maps  the  set  of  labelled  points  in  one 
representation  into  the  corresponding  set  in  the  second 
representation.  In  this  p2g)er  we  discuss  the  computa¬ 
tion  and  interpretation  of  these  deformations  for  two 
particular  fields  of  application  and  the  visualization  of 
these  deformations  using  the  graphical  technique  of 
biorthogonal  grids  (Bookstein  1978). 

Perhaps  the  primary  examples  of  comparisons  of 
sets  of  points  through  deformations  is  in  the  field  of  car¬ 
tography.  Distortions  induced  by  representing  the  sur¬ 
face  of  the  earth  with  (planar)  maps  are  studied  in  terms 
of  the  properties  of  various  map  projections  (Richardus 
and  Adler  1972).  In  fact  the  basis  of  the  method  of 
biorthogonal  grids  presented  here  was  established  a  cen¬ 
tury  ago  by  Tissot  (1881)  for  just  such  problems. 
Tissot’s  theorem  shows  that  the  image  of  any  small 
(infinitesimal)  circle  under  a  continuous  transformation 
is  an  ellipse — known  in  cartogrtqjhy  as  Tissot' s  indica- 
trix.  The  axes  of  the  ellipse  represent  the  local  principal 
(maximum  and  minimum)  strains  of  the  transformation 
and  the  ratio  of  the  area  of  the  ellipse  to  the  area  of  the 
circle  represents  the  proportionate  change  (distortion)  in 
surface  area.  The  theorem  further  states  that  at  any 
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point  in  the  domain  image  (e.g.  the  surface  of  the  earth) 
there  is  a  unique  pair  of  (infinitesimal)  lines  or  direc¬ 
tions  at  90®  that  intersect  also  at  90®  in  the  response 
image  (e.g.,  planar  map),  unless  the  transformation  is 
conformal.  These  lines  are  the  axes  of  Tissot’s  indica- 
trix.  The  distinctive  feature  of  these  cartographic  appli¬ 
cations  is  that  projections  relating  the  surface  of  the 
earth  to  the  map  are  known  analytically.  Tobler  (1978) 
suggests  other  problems  in  the  study  of  geographic  pat¬ 
terns  where  the  mappings  must  be  computed  from  data. 

Problems  in  biology,  or  more  specifically 
morphometries — the  measurement  of  biologic  shapes, 
their  variation  and  change — motivated  the  development 
of  the  algorithms  presented  here.  In  1917,  D’Arcy 
Thompson  introduce  the  idea  of  using  mathematical 
deformations  for  describing  or  reifying  the  theoretical 
construct  of  biological  homology.  Two  biological 
forms  were  to  be  compared  in  terms  of  a  deformation  of 
one  form  into  the  other.  Images  to  be  compared  might 
be  of  two  distinct  biologic  species  related  in  evolution¬ 
ary  terms,  or  images  of  the  same  biologic  specimen 
observed  at  two  different  ages  in  a  longitudinal  study  of 
growth.  For  visualization  Thompson  used  the  method 
of  transformation  grids  in  which  a  square  or  regular  grid 
superimposed  on  the  image  of  one  biologic  form  is  tran- 
formed  into  an  irregular  grid  over  a  second  form. 

Many  investigators  attempted  to  place 
Thompson’s  seminal  idea  on  a  precise  mathematical 
foundation  suitable  for  computer  analysis  and  measure¬ 
ment  of  shape  change.  This  was  not  achieved  until 
Bookstein’s  (1978)  introduction  of  the  method  of 
biorthogonal  grids.  (Closely  related  methods  were  sug¬ 
gested  independently  by  Tobler  (1978)  at  about  the 
same  time.)  Bookstein’s  approach  focusses  on  labelled 
points  called  landmarks  of  anatomical  or  evolutionary 
significance  in  the  biologic  images  to  be  compared.  The 
purpose  of  Bookstein’s  analysis  was  the  depiction  and 
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measurement  of  shape  change  interpolated  smoothly 
from  the  sets  of  corresponding  labelled  landmarks  in  the 
two  images.  The  method  involves  (a)  interpolation  of  a 
correspondence  between  n  pairs  of  points  in  7?^  into  a 
differentiable  mapping  defined  everywhere  in  the  plane, 
and  (b)  the  drawing  (and  labelling)  of  integral  curves  of 
the  infinitesimal  perpendicular  lines  guaranteed  to  exist 
(for  differentiable  transformations)  according  to 
Tissot’s  Theorem.  A  formal  definition  for  these  curves 
is  as  follows. 

Definition.  Through  almost  every  point  of  a  differenti¬ 
able  transformation  pass  just  two  differentials  which  are 
at  90°  both  before  and  after  transformation.  The 
integral  curves  of  these  differentials  form  a  grid  whose 
intersections  are  at  90°  in  both  images.  These  are  called 
the  biorthogonal  grids  of  the  transformation. 

Another  field  to  which  the  method  of  biorthogo- 
nal  grids  has  been  applied  is  the  statistical  analysis  of 
spatial  data  obtained  in  routine  monitoring  of  environ¬ 
mental  processes.  Examples  include  spatial  analyses  of 
mesoscale  variation  in  solar  radiation  (Sampson  and 
Guttorp  1991),  wind  speed  (Guttoip  and  Sampson 
1989),  rainfall,  and  acid  deposition  (Guttorp  et  al  1991). 
Sampson  and  Guttorp  suggested  that  the  spatial  covari¬ 
ance  structure  of  environmental  monitoring  data  could 
be  represented  and  estimated  in  terms  of  a  function 
mapping  the  geographic  locations  of  a  set  of  monitoring 
stations  (with  coordinates  generally  being  planar  coordi¬ 
nates  from  a  map  projection  whose  effects  are  being 
ignored)  into  a  second  synthetic  set  of  planar  coordi¬ 
nates  computed  to  encode  the  spatial  covariance  struc¬ 
ture;  distances  between  the  stations  encode  observed 
spatial  covariances  so  that  greater  covariances  are 
represented  by  smaller  distances.  In  this  application, 
the  biorthogonal  grids  reflect  spatially  varying  anisotro¬ 
pic  spatial  covariance  structure — what  may  be  called  a 
"moving  principal  components  analysis  of  the  spatial 
covariance  structure.” 

While  the  algorithms  and  graphics  are  the  same 
for  each  of  the  applications  cited  above,  there  are 
important,  albeit  subtle  differences  of  interpretation. 
For  the  analysis  of  map  projections  the  analytically 
specified  mappings  apply  to  all  locations  in  the  domain 
image,  and  thus  Tissot’s  indicatrix  can  be  computed  and 
interpreted  everywhere.  In  the  environmental  monitor¬ 
ing  problem  we  compute  a  smooth  mapping  from  data 
at  a  finite  set  of  points  considered  as  a  spatial  sample  of 
an  underlying  process.  Thus  the  biorthogonal  grids 
computed  by  interpolation  may  be  interpreted  as  esti¬ 
mates  of  a  phenomenon  (spatial  covariance  pictured  as 
deformation)  which  is  defined  (and  theoretically  observ¬ 
able)  everywhere.  That  is,  pairs  of  images  can  conceiv¬ 


ably  be  generated  for  monitoring  stations  located  any¬ 
where  in  the  geographic  region  of  interest. 

However,  in  the  morphometric  applications 
correspondences  between  images  cannot  generally  be 
e.stablished  except  at  a  finite  (or  one-dimensional)  set  of 
points.  The  biorthogonal  grids  provide  an  illustration 
referring  only  to  the  landmarks  available  for  analysis. 
One  cannot  argue  that  they  represent  (estimates  oO  real 
deformation  as  correspondences  (homologies)  cannot  be 
defined  everywhere.  (See  Bookstein  1991.) 

In  Section  2  we  explain  the  interpretation  of 
biorthogonal  grids  for  a  pair  of  simple  examples  based 
on  a  hypothetical  square  configuration  of  four  land¬ 
marks.  We  utilize  thin-plate  splines  to  represent  the 
deformations  we  compute  from  corresponding  land¬ 
marks  in  pairs  of  images.  Section  3  explmns  the 
rationale  for  the  choice  of  thin-plate  splines  and  reviews 
their  algebra.  Section  4  details  the  algorithms  for  draw¬ 
ing  biorthogonal  grids  for  a  specified  mapping.  The  last 
section  presents  a  pair  of  (very  different)  real  applica¬ 
tions. 

2.  Interpretation  of  biorthogonal  grids 

We  begin  with  simple  linear  or  affine  transforma¬ 
tions,  u  =f  {x)  =  Ax,  where  A  is  a  2x2  matrix  and  x 
and  u  represent  coordinate  vectors  in  two  images  to  be 
compared,  respectively.  Linear  mappings  are  character¬ 
ized  by  a  single  pair  of  principal  axes  given  by  the 
eigenvectors  of  A^A  (or  left  singular  vectors  of  A). 
The  direction  or  axis  corresponding  to  the  largest  eigen¬ 
value  is  the  direction  in  which  the  plane  is  (relatively) 
most  sketched.  The  ratios  of  distance  in  the  second 
image  to  distance  in  the  first  for  pairs  of  points  x,  and  Xj 
aligned  with  the  principal  axes  (e.g.  Im,  -m;  I/Ix,  -xy  I) 
are  called  the  principal  strains.  These  are  the  square 
roots  of  the  eigenvalues  of  A  ^A  (or  the  singular  values 
of  A). 

Figure  1  depicts  the  effect  of  a  linear  transforma¬ 
tion  on  a  starting  configuration  of  four  points  arranged 
in  a  square.  The  resulting  figure  is  a  parallelogram.  The 
families  of  perpendicular  lines  indicate  the  directions  of 
the  principal  axes.  They  correspond  between  the  two 
images  and  are  the  biorthogonal  grids  for  the  linear 
u-ansformation.  The  figure  refers  to  the  principal  suains 
as  gradients  ("grad").  Those  for  the  transformation 
from  parallelogram  to  square  are  the  inverses  of  those 
for  the  transformation  from  square  to  parallelogram.  In 
this  case  the  two  principal  strains  are  0.856  (coded  by 
dotted  lines  in  the  left  panel  of  Figure  I)  and  0.506 
(coded  by  dashed  lines). 
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A  simple  nonlinear  mapping  of  a  square  into  an 
arbitrary  quadrilateral  is  depicted  in  Figure  2  (approxi¬ 
mately  replicating  Figure  VI-6  in  Bookstein  1978).  We 
will  write  such  a  nonlinear  mapping/ :  as 


In  the  neighborhood  of  any  point  (x,y)  we  can  compute 
a  local  linear  approximation 


+  •  •  • . 


where  the  matrix  Axj,  is  the  (affine)  derivative  matrix 
evaluated  at  the  point  (x  ,y), 


A 


du/dx  du/dy 
dv/dx  dv/dy 


The  left  singular  vectors  of  indicate  the  local 
principal  axes  of  the  nonlinear  transformation.  These 
are  the  differentials  referred  to  in  the  definition  of 
biorthogonal  grids  given  above.  In  Figure  2  we  can 
identify  these  directions  along  the  curves  drawn.  For 
example,  in  the  neighborhood  of  point  4,  the  first  local 
principal  axis  (with  a  principal  strain  between  O.S  and 
0.75)  points  in  a  direction  at  approximately  30°  below 
the  horizontal.  In  the  vicinity  of  point  2,  the  first  princi¬ 
pal  axis  (with  a  strain  greater  than  0.75)  has  rotated  to 
an  angle  of  approximately  45°  from  the  horizontal.  Note 
that  the  biorthogonal  grids — a  sampling  of  integral 
curves  of  these  local  principal  directions — depict  the 
spatial  variation  in  the  principal  strains  (derivative)  of  a 
nonlinear  mapping. 


3.  Algebra  of  Uiin-piate  splines  for  plane  mappings 

Our  problem  is  to  determine  a  smooth  mapping 
/:  that  interpolates  a  correspondence  between 

labelled  points  (x,,y,)  and  («i,v,),  respectively, 
j=l,2,...yj ,  in  two  images.  That  is,/  must  satisfy 

(tt. .  V, )  =  /  (Xi  ,yi  )  =  (u  (Xi  ,yi ),  v  (x.  ,y, )).  ( 1 ) 


As  noted  above,  biorthogonal  grids  depict  the  spatial 
variation  in  the  derivative  of  a  mapping  / .  Therefore, 
as  a  measure  of  smoothness  it  is  natural  to  consider  the 


family  of  thin-plate  splines  which  minimize  variation  in 
the  derivatives  as  expresssed  in  the  following  quantity. 


h  = 


dx  dy. 


The  terms  involving  u  and  v  are,  .separately,  equations 
for  the  bending  energy  of  an  idealized  thin  metal  plate 


displaced  to  pass  through  "vertical"  coordinates  ui  or  Vj 
respectively,  at  the  points  (x,  ,y, ).  Note  that  linear  func¬ 
tions  of  X  and  y  have  zero  bending  energy;  i.e.,  they  are 
perfectly  smooth. 

We  review  here  the  algebra  of  thin-plate  splines. 
Similar  presentations  appear  in  Bookstein  (1989,  1991). 
Consider  first  the  problem  for  just  one  of  the  response 
coordinates,  v(x,y).  Let  P,  =(x,  ,yi)^  and  denote  the 
distance  between  points  i  and  j  by  r,y  =  \Pi-Pj\. 
Define  the  function  {/(Ir  t)  =  r^logr^.  Then  the  solu¬ 
tion  v(x,y)  minimizing  (the  v  components  oO  f/  sub¬ 
ject  to  the  interpolation  constraints  (1)  is 

V(x.y)  =  V(\Pi  -{x,yy  \  )+a„o+a„x+a^y, 

where  the  coefficient  vector  defined  as 
B  =  (h'vi  ,  •  •  •  ,  Wy*  .flyo.flv*  .flv>)  satisfies  a  linear  sys¬ 
tem 


y  =Le. 


y  =  (vi  ,V2,  •  •  •  ,v„  ,0,0,0)^  and  the  (n+3)x(n+3) 
matrix  L  is 


L  = 


K  P 
0 


where 


K 


0  f/(r,2) 

t/(r2i)  0 

•  •  ■  f/(r32) 


V{ru) 

V{r2n) 


and 


0 


1  yi 

1  X2  yi 
1  x„  y„ 


For  further  discussion  of  the  role  and  interpreta¬ 
tion  of  the  function  U ,  see  Bookstein  (1991,  Ch.  8). 

The  value  of  the  (minimal)  bending  energy  is  pro¬ 
portional  to  a  quadratic  form  in  the  coefficients 
W  =  (wyi ,  •  ■  •  ,w„)^,  or  in  the  observations 
V  =(vi,V2,  •  •  •  ,v,)^. 


lfo^W'^KW  =  V^(L„-'AfL,-*)V 

=  v^iL,r')v, 


where  L,”'  refers  to  the  uj^r  left  nxn  submatrix  of 
L~'.  Note  that  this  quadratic  form  is  zero  if  and  only  if 
w»i  =  w,2=  •  ■  •  =0,  which  means  that  the  fitted  spline  is 
linear,  v(x,y)  =  a,o+a«x  +a„yy. 
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For  the  two-dimensional  response  problem  of 

interest  here  we  define  V  as  an  n  x2  matrix, 
r 

Vl 
V2 

V« 


If  oc  tr(y^{U^)V). 

From  this  expression  it  is  easy  to  see  that  the  quantity 
minimized  is  invariant  under  arbitrary  translation  and 
rotation  of  the  coordinates  (u.v).  In  fact,  the  whole 
procedure  is  invariant  under  translation  and  rotation  of 
either  set  of  coordinates,  (x  ,y )  and/or  (« ,v ). 


4.  Computing  and  drawing  biorthogonal  grids 


This  section  explains  how  we  compute  and  draw 
biorthogonal  grids  as  a  visualization  of  a  differentiable 
mapping  f-.R^-^R^  over  a  sp)ecified  region  of  the 
plane.  For  a  given  mapping/,  e.g.,  a  thin-plate  spline, 
we  can  compute  local  linear  approximations  in  terms  of 
the  affine  derivative  matrix  i4,j,  defined  above.  From 
this  matrix  we  can  compute  the  differentials  at  (x,y) 
correspxmding  to  the  sets  of  curves  of  the  biorthogonal 
grids  as  the  left  singular  vectors  of  Write  the 
direction  of  greatest  principal  strain  as 


02 


*0’ 


In  wder  to  draw  out  the  integral  curve  of  the  dif¬ 
ferentials  emanating  from  the  praint  (x,y)  we  solve  a 
system  of  differential  equations 


where  i  denotes  aiclength  along  the  curve  in  the  direc¬ 
tion  of  the  local  principal  axis.  Points  along  the  curve 
are  computed  by  running  a  (Runge-Kutta)  differential 
equation  solver  in  “one-step”  mode.  That  is,  we  evalu¬ 
ate 


where  tmM%  is  the  maximum  step  size  to  be  taken.  In  one 
«tep  mode  this  returns 


The  size  of  the  stepts  5t ,  i.e.,  the  spacing  of  points  gen¬ 
erated,  deptends  on  the  curvature  of  the  grids  or  the  spa¬ 
tial  rate  of  change  of  the  affine  derivative  matrix.  Asso¬ 
ciated  with  each  step  is  the  principal  strain,  the  singular 


value  of  the  affine  derivative  evaluated  at  the  starting 
point.  These  values  are  used  to  code  or  label  the  line 
segments  connecting  the  sequence  of  pioints  generated 
along  the  curves. 

The  user  must  consider  a  number  of  graphical 
design  decisions  in  drawing  out  a  sample  of  curves  for  a 
biorthogonal  grid.  Our  implementation  in  the  S  system 
(Splus,  Statistical  Sciences  1990)  provides  the  user  with 
a  variety  of  qitions.  A  default,  semi-automatic  pro¬ 
cedure  starts  by  generating  pxiints  along  the  curves  in 
both  the  major  (greatest  local  strain)  and  minor  (least 
local  strain)  ptrincipal  directions  emanating  from  a 
user-specified  starting  point  on  the  first  image.  Then,  at 
pioints  appiroximately  equally  spaced  along  the  curve  of 
greatest  local  strain  just  generated,  the  pirogram  initiates 
a  series  of  points  along  the  curves  in  the  direction  of 
least  local  strain.  It  similarly  initiates  series  of  pmints 
defining  curves  in  the  direction  of  greatest  local  strain 
from  approximately  equally  spaced  starting  points  along 
the  first  curve  in  the  minor  direction. 

We  connect  the  pxiints  in  curves  with  the  value  of 
the  principal  strain  encoded  in  the  line  type,  line  width, 
and/or  line  color.  Color  seems  the  most  effective  visual 
cue  for  recognizing  the  relative  magnitude  of  the  pirinci- 
pal  strains  as  they  vary  in  space  (although  we  do  not 
demonstrate  color  here). 

Using  Splus’  interactive  graphics  capability  the 
user  may  initiate  biorthogonal  grid  curves  from  any 
location  in  the  first  image.  The  user  may  also  pioint  at 
plotted  curve  in  order  to  print  the  value  of  the  strain  at 
that  point.  A  system  of  Splus  functions  for  the  calcula¬ 
tions  and  graphics  discussed  here  is  available  from  the 
authors. 

5.  Applications 

Our  first  example  is  drawn  from  a  classical  mor¬ 
phometric  study  of  a  congenital  craniofacial  deformity 
known  as  Apert  Syndrome  (see  Bookstein  1991).  Land¬ 
mark  coordinates  were  digitized  from  lateral  cephalo- 
grams  of  14  cases  of  the  syndrome.  Pictured  in  Figure  3 
are  the  mean  coordinates  of  eight  landmarks  for  these 
14  cases  and  a  similar  mean  configuration  for  a  sample 
of  age  and  sex-matched  controls  (“normals”).  Our 
interest  is  in  describing  the  difference  in  these  two 
configurations  by  viewing  the  mean  Apiert  configuration 
as  a  deformation  of  the  mean  configuration  of  the  con¬ 
trols. 

In  applications  it  is  often  useful  to  visualize  plane 
mappings  using  both  the  image  of  the  mapping  of  a  reg¬ 
ular  grid  of  points  (after  D’Arcy  Thompson)  and 
biorthogonal  grids.  At  the  bottom  of  Figure  3  we  depict 
the  thin-plate  spline  mapping  the  normal  mean  land- 
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mark  configuration  into  the  Apert  mean  landmark 
configuration  by  the  mapping  of  a  regular  grid  and  in 
the  upper  right  comer  we  show  the  corresponding 
biorthogonal  grid.  We  utilize  varying  line  width  to 
encode  the  principal  strains  of  the  mapping.  The  strains 
vary  from  0.33  (in  the  vicinity  of  the  points  PtM  and 
PNS)  to  1.20  (between  points  Sel  and  SER).  Bookstein 
(1989,  1991)  shows  how  the  shape  change  represented 
in  Figure  3  can  be  usefully  decomposed  into  features  or 
components  of  varying  geometric  scales. 

Our  second  example  concerns  an  application  of 
plane  mappings  in  spatial  statistics.  A  number  of  moni¬ 
toring  networks  have  been  measuring  acid  deposition  in 
rainfall  over  the  past  two  decades.  Problems  concern¬ 
ing  the  estimation  of  acid  deposition  at  unmonitored 
locations  and  the  design  or  evaluation  of  monitoring 
networks  all  require  information  about  the  spatial 
covariance  structure  of  the  environmental  process  being 
monitored.  In  this  application  we  consider  (log)  hydro¬ 
gen  ion  deposition  accumulated  for  four-weekly  inter¬ 
vals  from  data  measured  between  1981  and  1986  at  17 
monitoring  sites  from  the  UAPSP  monitoring  network. 
We  denote  by  Z(r  ,jc)  the  observations  at  location  x  and 
time  (month)  r.  We  are  interested  in  modeling  the 
“spatial  dispersion”  Var(Z(t,Xa)-Z{t,Xb))  as  a  func¬ 
tion  of  arbitrary  pairs  of  geographic  locations  Xa  and  Xb . 
The  sample  data  provides  estimates  of  these  variances 
for  pairs  of  monitoring  sites, 

dij=Var(,Z{l,Xi)-Z(tjcj)). 

Sampson  and  Guttorp  (1991)  introduced  a  family 
of  models  of  the  form 

Var(Zii,Xa)-Z{tM))  =  g(\f(x„)-f{xb)\), 

where  /  is  a  nonlinear  mapping  of  the  geographic  coor¬ 
dinates  of  the  sampling  sites  and  g  is  a  monotone 
“variogram”  function  relating  the  di}  to  the  distances 
among  the  transformed  points  I / (xi)-f  (Xy)  I . 

Figure  4  shows  the  location  of  17  monitoring  sta¬ 
tions  and  depicts  the  thin-plate  spline  mapping  /  that 
represents  the  nature  of  the  spatial  covariance  structure. 
Coordinates  of  sites  in  the  lower  right  image  were  com¬ 
puted  in  two  steps  as  described  in  Sampson  and  Guttorp 
(1991).  First  we  applied  multidimensional  scaling  to  the 
matrix  of  spatial  dispersions  dif  to  generate  a 
configuration  in  which  pairs  of  sites  x,  and  xj  with  rela¬ 
tively  high  spatial  dispersion  (low  covariance)  would  be 
located  relatively  far  apart.  Second  we  computed  a 
thin-plate  smoothing  spline  to  approximate  these  new 
coordinates  as  a  smooth  deformation  of  the  geographic 
configuration. 


The  biorthogonal  grid  specifies  (our  estimates  oO 
the  geographic  directions  in  which  spatial  dispersion  is 
greatest  and  weakest.  Different  line  types  encode  the 
range  of  values  of  the  principal  strains.  Variation  in 
these  principal  strains  reflects  nonstationarity  in  the  spa¬ 
tial  covariance  structure,  and  can  be  understood  from 
the  perspective  of  the  atmospheric  processes  underlying 
the  monitored  data.  The  large  scale  feature  of  the  map¬ 
ping  in  Figure  4  is  a  relative  compression  along  an  axis 
running  WSW-ENE  corresponding  to  a  strong  spatial 
covariance  in  that  direction  and  relatively  weak  covari¬ 
ance  in  the  direction  at  90°. 
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Figure  1 .  Blorthogonal  grids  for  a  linear  mapping 
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Figure  2.  Biorthogonal  grids  for  a  nonlinear  mapping 
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Apert  Mean  Landmark  Configuration 
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Figure  4.  Thin-plate  spline  mapping  of  UAPSP  monitoring  stations 
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Coordinates  Representing  Spatial  Dispersion 
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Abstract 

This  report  describes  initial  experience  with  several  image 
sharpening  and  registration  tools  and  their  applications  in 
a  longitudinal  study  of  pediatric  brain  tumors  (astrocytoma, 
medulloblastoma).  Image  sharpening  is  defined  as  the  re¬ 
moval  of  unnecessary  blurring  of  boundaries  and  shapes  by 
some  automatic  method.  We  define  registration  as  the  super¬ 
imposition  or  optimal  matching  of  repeated  pictures  taken  for 
a  single  patient  over  time  or  across  different  imaging  modal¬ 
ities.  A  long-range  goal  is  to  develop,  apply  and  extend 
these  tools  to  extract  the  most  clinically  useful  information 
from  sets  of  serial  single  photon  emission  computed  tomog¬ 
raphy  (SPECT)  and  magnetic  resonance  (MR)  whole  brain 
scans  taken  both  pre-  and  post-surgery  for  roughly  seventy 
pediatric  patients  per  year  over  a  period  of  several  years. 
We  use  two  radioisotopically-labelled  tracers,  Thallium-201 
and  a  technetium  tracer,  99mTcHMPAO.  SPECT  images  are 
more  blurred  than  they  need  to  be.  Corrections  for  uniform 
photon  attenuation  (i.e.  assuming  only  one  medium  such  as 
soft  tissue)  have  been  shown  to  have  initial  success.  Objec¬ 
tive  and  highly-automated  image  registration  through  esti¬ 
mation  of  pixel-by-pixel  deformation  maps  from  one  image 
to  the  next  have  also  shown  promise.  The  initial  successes 
of  these  two  developments  in  other  contexts  indicate  that  ob¬ 
jective  and  highly  automated  methods  could  be  developed 
to  yield  accurate,  repeatable  and  verifiable  methods  for  the 
extraction  of  useful  SPECT/MR  image  summary  measures 
in  biomedical  longitudinal  studies,  and  in  partic’lar  for  the 
study  of  childhood  brain  tumors.  The  combination  of  these 
two  technologies  could  improve  both  the  quality  of  serial 
images  and  of  biostatistical  analyses  of  extracted  imaging 
data  in  general.  We  give  some  initial  results  on  the  use  of 
a  method  for  sharpening  SPECT  images  through  estimation 
of  attenuation  and  scattering  functions  and  the  use  of  a  de¬ 
formable  template  method  to  register  a  few  SPECT  slices 
for  the  same  patient  at  different  times. 


*  Address  for  correspondence:  Nicholas  Lange,  Division 
of  Biology  and  Medicine,  Box  G-A424,  Brown  University, 
Providence,  RI  02912. 


1.  Motivation  and  context 

The  field  of  childhood  brain  tumors  has  shown  substantial 
areas  of  promise  in  recent  years  with  the  development  of 
new  treatment  protocols  that  improve  outcome,  both  in  terms 
of  longevity  and  survival  (e.g.  in  medulloblastoma,  astrocy¬ 
toma).  Because  of  this  improvement,  there  is  a  correspond¬ 
ing  need  for  more  precise  methods  of  assessing  tumor  growth 
in  order  to  be  able  to  assess  the  response  to  treaunent,  and  to 
be  able  with  confidence  to  distinguish  the  presence  of  tumor 
from  brain  damage  due  to  complications  of  the  treatment. 
Early  and  accurate  discrimination  between  these  two  types 
of  post-treatment  changes  can  be  vital  to  the  management 
of  the  patient 

There  are  two  newly  emerging  biomedical  imaging 
technologies:  single  photon  emission  computed  tomography 
(SPECT)  and  magnetic  resonance  (MR)  imaging.  SPECT 
images  provide  data  on  internal  functional,  metabolic  events 
through  use  of  one  or  more  radioisotopically  labelled  tracers. 
MR  images,  obtained  without  such  tracers,  provide  data  on 
structural  anatomic  features. 

Among  emission  tomography  techniques,  as  opposed  to 
transmission  tomography  methods  such  as  computed  tomo¬ 
graph  (CT),  single  photon  emission  computed  tomography 
(SPECT)  differs  from  positron  emission  tomography  (PET) 
in  at  least  two  important  ways.  In  SPECT,  only  one  pho¬ 
ton  is  released  and  recorded  when  a  radioactive  decay  oc¬ 
curs,  whereas  in  PET  two  photons  are  propagated  in  op¬ 
posite  directions  and  help  to  verify  each  others’  emission 
points  within  the  brain.  SPECT  is  thus  more  prone  to  mea¬ 
surement  errors.  A  benefit  of  SPECT  technology,  however, 
is  that  it  is  much  less  expensive  and  less  cumbersome  to 
operate,  not  needing  a  cycloPon.  SPECT  is  the  imaging 
modality  of  choice  in  our  clinic!»l  setting  due  to  this  ad¬ 
vantage.  For  a  more  complete  description  of  SPECT,  sec 
for  example  Geman  and  McClure  (1985,  1987),  Manbcck 
(1990)  and  references  therein. 

Use  of  the  radioisotopically  labelled  tracer  Thallium- 
201  in  SPECT  provides  a  promising  biological  marker  for 
the  extent  of  biologically  active  tumor  (Kaplan  ct  al.,  1987). 
The  diagnostic  specificity  of  Thallium-201  has  been  im¬ 
proved  by  performing  an  additional  scan  with  the  technetium 
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tracer  99mTcHMPAO  to  estimate  cerebral  perfusion.  The 
demonstration  of  increased  perfusion  at  the  site  of  Thallium- 
201  abnormality  favors  active  tumor. 

In  this  report,  we  explore  both  the  global  sharpening  of 
SPECT  images  and  an  objective  pixel-by-pixel  registration 
method  for  repeated  images  taken  on  the  same  patient. 
Image  sharpening  is  defined  as  the  removal  of  unneccessary 
blurring  of  boundaries  and  shapes  by  an  automatic  objective 
method.  Image  registration  is  defined  as  the  superimposition 
or  optimal  matching  of  serial  images  taken  over  time  or 
across  different  imaging  modalities,  such  as  SPECT  and 
MR,  for  a  single  individual,  assisted  in  some  cases  by  other 
sources  including  data  from  different  individuals  or,  only  if 
absolutely  required,  a  knowledgeable  expert. 

SPECT  images  are  much  more  blurred  than  they  need  to 
be.  Many  current  “canned”  and  widely-used  reconstruction 
algorithms,  usually  of  the  filtered  back-projection  type,  do 
not  correct  for  nonuniform  photon  attenuation  and  depth- 
dependent  scatter,  and  do  not  account  for  the  random  nature 
of  photon  emission.  Current  image  registration  methods  are 
often  global  in  nature,  highly  operator-assisted,  rely  heavily 
on  subjective  judgements  and  human  interactions,  and  yield 
results  that  are  often  non-repcatable  and  not  objectively 
verifiable. 

Two  prospective  clinical  trials  of  radio/chemotherapy 
and  surgery  for  astrocytomas  and  medulloblastomas,  con¬ 
ducted  by  the  Dana-Farber  Cancer  Institute,  Boston,  MA, 
provide  a  database  of  test  images.  Imaging  for  the  clini¬ 
cal  trials  is  performed  at  the  Division  of  Nuclear  Medicine, 
Children’s  Hospital,  Boston,  MA.  Initial  whole  brain  scans 
are  obtained  during  pre-surgical  evaluation  as  part  of  routine 
examination.  At  most  seven  triplets  of  SPECT  and  MR  im¬ 
ages  at  1,  3,  6,  9,  12,  18  and  24  months  are  obtained  during 
the  course  of  the  two  year  treatment  schedule.  Four  addi¬ 
tional  sets  of  images  for  each  patient  are  obtained,  one  for 
each  year  of  follow-up  observation.  Differing  numbers  of 
pictures  per  patient  arise  because  imaging  data  are  missing 
at  certain  occasions  and  recorded  at  unscheduled  times,  and 
also  due  to  patient  death  and  censoring.  These  somewhat 
irregularly  spaced  measurements  pose  no  problems  whatso¬ 
ever  to  our  study.  Our  test  database  of  SPECT/MR  images 
contains  such  data  for  50-75  pediatric  patients  at  present. 
Figure  1  gives  a  diagram  of  the  structure  of  images  taken 
for  each  patient  at  each  time. 

Two  problems  hinder  extraction  of  accurate  and  objec¬ 
tive  functional  image  summary  measures  from  serial  scans: 
(1)  SPECT  images  are  not  as  sharp  as  they  could  be  if  cor¬ 
rected  for  photon  attenuation  and  scatter,  and  (2)  there  is 
a  lack  of  an  objectively  derived  common  frame  of  refer¬ 
ence  within  which  to  compare  repeated  images  on  the  same 
patient  over  time,  or  to  compare  a  picture  for  a  particular 
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patient  at  a  particular  time  to  a  corresponding  f.orrnal  indi¬ 
vidual’s  picture. 

Stochastic  models  for  repeated  image  summary  mea¬ 
sures  require  objective,  well-defined  criteria  and  repeatable 
procedures  for  their  extraction  if  one  is  to  be  able  to  an¬ 
alyze  time  trends  and  treatment  effects  beyond  otherwise 
requisite  subjective  and  somewhat  incommensurate  clinical 
judgements.  Trends  in  functional  ir’age  summaries,  such  as 
may  be  present  in  repeated  tumor/brain  ratios  and  areas  over 
time  (O’Tuama  et  al.,  1991),  can  be  modeled  through  direct 
extensions  of  recent  biostatistical  methodology  (sec,  for  in¬ 
stance,  Laird,  Lange  and  Stram,  1987;  Lange  and  Ryan, 
1989;  Lange  and  Laird,  1989;  Gelfand  and  Smith,  1990; 
Lange,  Carlin  and  Gelfand,  1991;  Diggle,  Lange  and  Bene§, 
1991).  In  addition,  the  combination  of  functional-metabolic 
(SPECT)  and  structural-neuroanatomic  (MR)  findings  holds 
great  promise  in  providing  more  knowledge  about  the  clini¬ 
cal  slate  of  the  patient  than  could  be  provided  by  cither  one 
of  these  technologies  alone  (Pelizarri,  Chen  et  al.  1989). 

2.  Goals  and  tools 

Our  long  range  goals  are  to  obtain  useful  sets  of  ob¬ 
jective,  verifiable  and  repeatable  image  summary  meaures, 
to  model  the  stochastic  processes  generating  these  measures 
longitudinally  over  time,  and  to  use  the  model  results  to  im¬ 
prove  clinical  interpretations  of  the  repeated  images.  Our 
immediate  goals  arc  to  match  SPECT  slices  for  each  patient 
over  time,  to  obtain  artcfact-frce  SPECT  reconstructions,  and 
to  try  deformable  template  methods  to  obtain  initial  charac¬ 
terizations  of  tumor  changes  over  time.  Interactions  between 
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components  of  our  long-  and  short-term  goals  are  shown  in 
Figure  2. 

Among  the  tools  available  for  our  goals  are  landmark 
based  global  registration  methods  such  as  thin-plate  splines 
(Bookslein,  1989),  principal  axes  transformations  (Alpcrt  et 
al.,  1990)  and  the  “head  and  hat”  method  (Pelizarri,  Chen  et 
al.  1989).  Possible  complements  to  landmark-based  global 
registration  tools  are  methods  for  obtaining  pixel-by-pixel 
image  mappings.  These  include  the  use  of  “atlases”  (Mow- 
fonh  and  Jin,  1988),  “multi-resolution  elastic  matching” 
(Basesy  and  KovaCiC,  1989),  and  the  related  deformable  tem¬ 
plate  methods  (Chow,  Grenander,  and  Keenan,  1988;  Amit, 
Grcnandcr  and  Piccioni,  1991). 

The  “head  and  hat”  registration  method,  used  at  the 
Children’s  Hospital  Medical  Center,  works  as  follows.  Sur¬ 
face  points  on  slices  from  SPECT  and  MR  scans  for  a  single 
patient  are  identified  and  thinned  semi-automatically  through 
manual  editing  of  results  obtained  from  standard  outline  ex¬ 
traction  software.  Once  these  external  points  have  been  iden¬ 
tified,  one  has  a  rough  SPECT  “hat”  which  is  to  be  fit  to  the 
MR  “head”.  The  fitting  problem  is  solved  by  Pelizzari  and 
Chen,  et  al.  (1989)  as  a  multivariate  nonlinear  regression, 
minimizing  the  sum  of  squared  residual  distances  from  the 
“hat”  to  the  “head”  along  vectors  through  the  center  of  the 
“head”.  Custom  fitted  “hats”  are  thus  produced,  and  interior 
features  interpolated  linearly. 

Available  tools  for  SPECT  reconsu-uctions  arc  the 
widely-used  filtered  back-projection  methods.  Also  avail¬ 
able  arc  Bayesian  reconstruction  methcxls  that  use  Markov- 
random  field  image  models  with  isotropic  priors  (Besag, 


1974;  Geman  and  Geman,  1984;  Vardi,  Shepp  and  Kauf¬ 
man,  1985;  Geman  and  McClure,  1985,  1987).  Filtered 
back-projection  methods  can  induce  artifacts  (boundary  and 
shape  blurring  and  smearing)  when  the  filter  applied  to  mar¬ 
ginal  projections  does  not  anticipate  certain  asymmcu-ics  in 
these  projections.  Corrections  also  need  to  be  made  for 
photon  attenuation  and  scatter  effects.  Weighted  distance 
methods,  such  as  the  “Chang  algorithm”  (Chang,  1978), 
are  widely  used.  Other  methods  estimate  and  correct  for 
SPECT  machine-specific  parameters  (Geman,  Manbeck  and 
McClure,  1991). 

As  has  been  described  by  Geman  and  McClure  (1985, 
1987)  and  by  Geman,  Manbeck  and  McClure  (1991),  photon 
attenuation  can  be  accommodated  through  specification  of  a 
matrix  A,  a  discrete  attenuated  Radon  transform,  operating 
on  an  unobserved  true  image’s  isotope  concentration  map 
X  to  yield  an  expected  observed  image  E(Y).  A  Bayesian 
image  reconstruction  model  typically  assumes  that  the  im¬ 
age  actually  observed  is  E(Y)  together  with  Poisson  noise, 
i.e.,  that  Pr(YlX)  is  Poisson  with  mean  AX.  The  recon¬ 
struction  problem  is  to  estimate  X  from  Y  while  account¬ 
ing  for  A  to  find  an  approximation  to  the  posterior  mean 
Ylx  XPr(X|Y),  by  the  method  of  iterated  conditional  ex¬ 
pectations  (Owen,  1986)  for  instance. 

Two  methods  for  estimating  the  non-uniform  attenua¬ 
tion  suggest  themselves:  to  use  a  CT  scan  that  measures 
attenuation  directly,  and/or  estimate  attenuation  from  a  MR 
scan,  or  to  estimate  attenuation  directly  from  the  raw  SPECT 
data.  We  choose  the  latter  approach,  a  cleaner  albeit  scien¬ 
tifically  more  challenging  solution.  As  described  by  Geman, 
Manbeck  and  McClure  (1991),  differences  between  observed 
and  actual  photon  counts  arise  from  three  main  sources:  col¬ 
limator  effect,  scatter  fraction  and  attertuation.  Collimator 
elfect  arises  from  “stray”  photons  being  recorded  in  collima¬ 
tors  other  than  those  directly  in  line  with  their  original  trajec¬ 
tories.  Scatter  fraction  is  the  proportion  of  “stray”  photons 
among  the  total  number  detected  that  account  for  a  depth- 
dependent  blurring  of  the  image.  Attenuation  is  the  process 
by  which  some  emitted  photons  are  not  detected,  due  to 
insufficient  energy  to  complete  their  paths  to  collimator(s) 
through  differing  media  such  as  bone  and  soft  tissue.  We 
describe  the  attenuation  correction  method  in  some  detail  in 
the  next  section. 

3.  SPECT  machine-specific  parameter 
estimation 

Correction  for  attenuation  and  scatter  effects  can  be  accom¬ 
modated  by  estimating  the  discrete  attenuated  Radon  tmas- 
form  A.  This  requires  estimation  of  a  line  spread  function 
r/,  due  to  scattered  indirect  photon  counts.  This  function 
is  not  observed  directly,  and  is  modeled  as  a  weighted  dif- 
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ference  between  observed  line  spread  functions  go  for  the 
ambient  medium  (air)  and  </„  for  denser  media  such  as  soft 
tissue  (m  =  1)  or  bone  (m  =  2).  Figure  3  gives  an  illustra¬ 
tive  diagram.  Weights  wq  and  w,  enter  in  the  estimation  of 
g,  as  functions  of  the  number  of  observed  photons  and  the 
unobserved  proportion  of  these  photons  due  to  scatter  (the 
“scatter  fraction”).  Both  of  these  weights  depend  on  the 
distances  of  events  to  the  gamma  camera.  The  line  spread 
functions  go  and  g,  are  modeled  under  suitable  parametric 
assumptions  (eg.  double  exponential,  Gaussian).  The  line 
spread  functions  do  not  need  to  be  assumed  members  of  any 
parametric  family  of  curves,  however.  The  standard  devia¬ 
tions  (To  and  are  modeled  as  linear  functions  of  distance. 
These  functions  are  used  to  obtain  interpolated  values  of 
line  spread  functions  at  all  distances.  The  scatter  fraction 
has  been  shown  to  be  as  high  as  30-40%  of  the  total  at  cer¬ 
tain  sites,  for  instance  in  regions  of  8-lOcm  internal  depth 
(Penny,  et  al.,  1990;  Geman  et  al.,  1991).  It  is  this  high 
fraction  of  scattered  photons  that  seems  to  account  for  much 
of  the  internal  blurring  of  SPECT  images,  although  much 
is  still  unknown  about  this  property  in  general  biomedical 
contexts. 


Figure  3. 


Estimation  for  uniform  attenuation  and  scatter  requires 
performing  two  phantom  experiments,  one  through  the  am¬ 
bient  medium  alone  (air)  and  one  through  medium  m  =  1 
alone  (which  we  chose  as  water).  Let  B  denote  the  number 
of  detector  bins  “ofF’  of  the  center  as  the  random  variable 
of  interest,  D  the  distance  through  the  attenuating  medium, 
and  d  (=  D  +  constant)  the  total  distance  from  the  from 
the  point  source  to  the  gamma  camera.  Geman,  Manbeck 
and  McClure  (1991)  model  the  line  spread  functions  as 

g,  (h\d,  D)  =  wo(D)go  (b\d)  i,,  (D)  g,  ib\D) 

with 

goih\d)  ~.V'(6,(t^(c/)) 

gAh\D)'^X{h.rr;{0)) 


and 

Wo  (D)  =  known 

Ws  (D)  =  ‘scatter  fraction’ 

=  726“’’''^  —  e~’^° ,  71,72  unknown. 

The  constant  c  depends  on  the  attenuating  medium  as  well 
as  on  the  tracer  used,  and  can  be  obtained  from  known, 
available  sources. 

Thus  the  line  spread  functions  are  assumed  to  be  Gauss¬ 
ian  ridges  with  depth-dependent  variances.  The  estimated 
depth-dependent  standard  deviations  are  set  equal  to  their 
expectations,  in  standard  method  of  moments  fashion,  and 
assumed  to  be  linear  functions  of  distance,  i.e. 

E  {do)  =  (To  =  ao jSod 

A  generalization  of  this  approach,  if  the  problem’s  com¬ 
plexity  required,  would  be  to  use  the  method  of  estimating 
functions  (Godambe,  1%0)  or  the  related  generalized  esti¬ 
mating  equations  method  (Liang  and  Zeger,  1986).  In  our 
approach,  the  coefficients  in  (1)  are  estimated  by  the  method 
of  ordinary  least  squares.  The  task  is  then  to  determine 
which  regions  in  a  particular  SPECT  image  correspond  to 
the  different  media.  This  can  be  done  either  by  estimating  an 
additional  unknown  vector  of  pixel  labels,  greatly  increasing 
the  dimensionality  of  the  problem,  or  through  labelling  each 
pixel  using  a  map  derived  from  a  concurrent  MR  scan  and 
a  working  registration  of  the  two  images  by  matching  sus¬ 
pected  skull  boundaries.  A  more  automatic  method  for  this 
second  approach  would  be  to  use  global  ellipses  (eg.  Alpert 
et  al.,  1990)  as  approximations  to  skull  boundaries,  or  to 
represent  irregular  boundaries  by  a  modified  Fourier  series 
(eg.  Zahn  and  Roskies,  1972).  In  our  present  case,  we  la¬ 
beled  pixels  in  different  regions  by  using  crude  ellipses,  the 
approximate  shape  of  the  hot  ring  of  the  scalp. 

4.  Serial  SPECT  registration  by 
deformable  templates 

We  have  chosen  to  focus  our  efforts  at  finding  common 
and  useful  frames  of  references  for  serial  SPECT  scans 
by  further  development  of  the  pixel-by-pixel  method  of 
registration  by  deformable  templates  (Amit  et  al.,  1991). 
This  is  a  local  method  by  which  one  obtains  a  deformation 
map  that  connects  each  pixel  in  one  picture  into  its  mate  in 
another  picture  through  minimization  of  a  global  goodness- 
of-fit  criterion,  while  maintaining  smoothness  constraints  in 
some  cases. 

Denoting  each  pixel  location  by  coordinates  x,  the 
method  of  deformable  templates  assumes  that  a  SPECT  im¬ 
age  /,  for  a  particular  patient  at  a  particular  time  t  over 
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Table  1.  Results  of  the  phantom  experiments.  Estimates  are  reported  in  centimeters. 


Thallium-201  99mTcHMPAO 


Standard  deviation 

intercept  (a) 

slope  (P) 

intercept  (q) 

slope  (/?) 

air  (<To) 

1.49 

.005 

1.46 

.033 

scatter  ((Tj) 

2.53 

.037 

2.33 

.260 

scatter  fraction 
(7i,T2,c)  in  cm“^ 

(.095,  1.15,  .194) 

(.012,  1.0,  .150) 

a  domain  S  is  a  deformation  of  an  earlier  image  /<<,  the 
template,  taken  for  this  same  patient.  One  determines  the 
deformation  map  of  into  L  by  finding  values  of  coeffi¬ 
cients  ^1, . . .  for  which  the  integrated  squared  distance 
between  images. 


I 

*€S 


/t(x)-  /f'  (x-l-^^pVJp(x) 

\  p=t  > 


dx  ,t'  <  t,  (2) 


is  a  minimum.  In  (2),  the  functions  , . . . ,  are  orthonor¬ 
mal  basis  functions  such  the  Fourier  basis  or  the  “wavelet” 
basis  (eg.  Mallal,  1987).  Minimization  of  (2)  is  done  by  gra¬ 
dient  descent.  As  discussed  by  Amit  et  al.  (1991),  the  prior 
distribution  on  the  set  of  possible  mappings  from  /</  into 
It  is  taken  as  multivariate  Gaussian  and  concentrated  near 

p 

the  identity  map,  where  ^p'Pp  (^)  —  0.  If  desired,  one 

p=i 

may  include  a  regularization  term  in  (2)  that  penalizes  non¬ 
smooth  mappings.  Note  that  no  landmarks  are  required  by 
this  method,  making  it  much  less  operator-assisted  and  sub¬ 
jective,  and  more  automated  and  objectively  verifable  than 
many  existing  registration  methods.  It  is  not  yet  known  to 
what  extent  the  deformable  template  method  can  be  com¬ 
plemented  by  landmark-based  methods. 


5.  Some  initial  results 

Figure  4  shows  transverse  Thallium-201  SPECT  slices  for 
a  single  patient  at  two  different  times,  both  post-surgery, 
spaced  about  two  months  apart.  Reconstruction  was  by 
commerically  available  filtered  back-projection  with  Chang 
attenuation  correction.  The  images  are  arranged  in  three 
rows  and  two  columns.  The  rows  proceed  from  about  ear 
level  toward  the  top  of  the  head,  and  are  at  roughly  the 
same  level  across  columns.  The  first  column  is  for  the  first 
scan,  the  second  column  for  the  subsequent  scan.  The  hot 
ring  in  each  is  due  to  the  uptake  of  Thallium-201  in  the 
scalp.  The  hot  spots  in  areas  interior  to  the  ring  indicate 
active  tumor.  Although  one  may  be  able,  by  the  eye,  to 
infer  that  the  tumor  has  grown  from  one  occasion  to  the 


next,  quantification  of  such  suspected  changes  in  tumor  size 
(eg.  edge,  area  and/or  volume),  as  well  as  quantification 
of  suspected  changes  in  tumw  shape,  would  be  affected 
strongly  by  artifacts  induced  by  the  filtered  back-projection 
reconstructions,  and  not  highly  reliable. 


5.1.  Sharpening 

SPECT  machine-specific  parameter  estimates  from  the  phan¬ 
tom  experiments  are  shown  in  Table  1.  Note  the  higher 
variability  and  scatter  fraction  in  the  weaker  Thallium-201 
scans. 

Figure  5  shows  the  result  of  applying  the  Bayesian 
image  reconstruction  model  described  in  §2  with  an  A 
matrix  that  accommodates  corrections  for  attenuation  and 
scatter,  obtained  from  the  phantom  experiment  results  shown 
in  Table  1.  Reconstruction  artifacts  appear  greatly  reduced 
and  areas  of  tumor  activity  more  localized. 


5.2.  Registration 

Figure  6  shows  an  application  of  the  deformable  template 
method  described  in  §3  to  some  filtered  back-projection  re- 
consUTictions  (not  included  in  Figure  4).  It  would  have  been 
preferable  to  apply  this  method  to  the  sharper  reconstruc¬ 
tions,  but  this  was  not  possible  by  the  time  of  this  writing. 
The  upper  lefthand  frame  of  Figure  6  is  the  “template”  /,-. 
The  lower  righthand  frame  is  the  observed  deformation  I, 
later  in  time.  The  upper  righthand  frame  is  the  pixel-by¬ 
pixel  difference  between  the  observed  images.  The  lower 
lefthand  frame  is  the  estimated  deformation  of  /,<  into 
The  map  to  the  right  gives  a  vector  for  each  pixel  indi¬ 
cating  the  direction  and  distance  each  has  moved  from  the 
template  image  to  its  deformation.  Note  a  general  outward 
movement  from  a  suspected  tumor  center.  However,  ar¬ 
tifacts  in  the  filtered  back-projection  reconstructions  seem 
to  preclude  reliable  clinical  interpretation  of  this  estimated 
deformation  map. 
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6.  Summary 

We  have  found  in  our  initial  experiments  that  for  our  por- 
blem  the  use  of  several  imaging  modalities  is  essential  in 
order  to  obtain  results  and  image  interpretations  that  are 
clinically  reliable.  We  have  found  in  addition  that  correc¬ 
tions  for  rigid-body  motions  (translation,  rotation,  scale) 
when  comparing  different  scans  are  also  mandatory.  When 
a  goal  is  to  obtain  reliable,  verifiable  and  repeatable  SPECT 
image  summary  measures,  we  question  the  use  of  filtered 
back-projection  reconstructions  of  the  low-energy  Thallium- 
201  scans  for  pediatric  brain  tumors.  Corrections  for  non- 
uniform  photon  attenuation  and  scatter  are  eseential.  We 
have  demonstrated  that  objective,  Bayesian  image  restora¬ 
tion  methods  can  yield  results  that  are  relatively  artefact- 
free.  More  work  need  to  be  done  with  the  application  of 
the  deformable  template  method  in  our  context,  in  particular 
with  the  sharpened  iamges.  Objective  and  verifiable  pixel- 
by-pixel  characterizations  of  tumor  changes  over  time  do 
appear  feasible,  however.  External,  historical  atlases  may 
help  in  normal  tissue  typing  and  exclusion  tasks.  One  of 
the  next  steps  in  our  research  will  be  to  try  out  some  semi¬ 
automatic  edge  extraction  methods  on  the  sharpened  SPECT 
images,  such  as  the  graduated  non-convexity  algorithm  pro¬ 
posed  by  Blake  and  Zisserman  (1987),  which  is  programmed 
and  available. 
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Figure  4.  Filtered  back -projection  reconstructions  of  transverse  Thaliium-201  SPECT  slices  for  a  single  pediatric  patient.  Rows  proceed 
from  about  midsection  of  the  brain  toward  the  top  of  the  head.  The  first  column  is  at  the  first  scan,  the  second  column  at  the  second 
scan  about  two  months  later.  The  rings  are  due  to  tracer  uptake  in  the  scalp.  Uptake  areas  interior  to  the  rings  indicate  active  tumor. 
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Figure  5.  Several  filtered  back-projection  reconstructions  (above),  with  their  Bayesian,  Markov  random 
field  reconstructions  with  corrections  for  attenuation  and  scatter  from  the  phantom  experiments  (below). 


Figure  6.  Results  of  an  initial  pixel-by-pixel  registration  of  a  selected  regions  of  filtered  back-projection  scans 
by  the  deformable  template  method.  Frames  to  the  left:  upper  lefthand:  the  “template”  /,< ;  lower  righthand  frame; 
the  observed  deformation  It  later  in  time;  upper  righthand;  the  pixel-wise  difference  between  the  observed  images; 
lower  lefthand  frame;  the  estimated  deformation  of  7,/  into  7,.  The  estimated  pixel-by -pixel  map  is  shown  on  the  right 
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Abstract 

We  discuss  several  models  for  shapes  in  the  plane  based 
on  the  distributions  of  landmarks  about  an  underly¬ 
ing  template.  The  motivation  for  these  models  includes 
Markov  random  fields  and  thin  plate  splines.  These  mod¬ 
els  are  used  as  priors  in  a  Bayesian  framework  to  recon¬ 
struct  a  shape  from  a  digital  image.  An  example  is  given 
based  on  the  human  hand. 

1  Introduction 

In  this  paper  we  shall  discuss  methods  to  pick  out  a 
shape  from  a  two-dimensional  digital  image.  The  shape 
is  assumed  to  be  a  deformation  of  some  underlying  shape 
or  ‘template’,  and  the  image  is  also  subject  to  observa¬ 
tional  noise.  We  represent  points  in  the  plane  as  complex 
numbers.  We  shall  focus  attention  on  the  case  where  the 
shape  can  be  described  as  a  simply  connected  domain 
D  C  C  whose  boundary  consists  of  piecewise  linear  path 
connecting  vertices  zo,zi,--.,Zn  €  C  with  zq  =  z„.  Let 
V  =  termed  the  ‘outline’  of  the  shape,  denote  the 
set  of  vertices.  Similarly,  let  Vo  =  {pj}  say,  denote  the 
outline  of  the  underlying  template. 

The  deformation  from  Vo  to  V  consists  of  two  types  of 
transformations.  The  first  type  of  transformation  con¬ 
sists  of  global  linear  changes  such  as  (a)  location,  (b) 
scale,  (c)  rotation  and  possibly  (d)  a  more  general  linear 
transformation  of  the  plane.  The  second  type  of  trans¬ 
formation  consists  of  local  changes  to  the  outline.  In 
this  paper  we  shall  discuss  various  probability  models 
for  the  local  change  to  the  outline  (including  the  loca¬ 
tion  change).  Thus  we  will  get  a  probability  distribution 
P{V)  on  the  outline  of  our  shape,  centred  at  the  under¬ 
lying  template  Vo-  Some  possible  models  are  given  in 
Sections  2-3. 

For  other  aspects  of  the  deformation,  such  as  scale 
and  rotation  changes  we  shall  use  ad  hoc  fitting  pro¬ 
cedures.  An  example  involving  the  reconstruction  of  a 


hand  is  given  in  Section  4.  Thus  our  approach  to  mod¬ 
elling  shapes  differs  from  other  approaches  (eg.  Goodall, 
1991,  Bookstein,  1986,  Kendall,  1984,  Kent,  1991,  Mar¬ 
dia  &  Dryden,  1989)  in  which  location,  scale  and  rotation 
effects  are  incorporated  directly  into  the  models. 

The  observed  digital  image  includes  information  about 
the  given  shape,  together  with  observational  errors.  One 
possible  model  is 

yr  =  t'l  +  fr  il  f  €  D 

yt  =  i’2  +  et  if  f  ^  D  (1.1) 

where  yt  €  ^  denotes  the  ‘grey-level’  in  the  f**  pixel 
and  i  =  (A. ^2)  labels  the  pixels  in  an  Li  x  L2  grid.  In 
the  simplest  version  of  the  model  we  suppose  the  c/  are 
independent  N{0,  random  variables.  The  mean  levels 
i/i  and  ^2  indicate  the  difference  between  the  shape  and 
the  background.  Thus,  given  V  the  model  for  y=  {ye} 
has  pdf 

P(y|K)  oc  exp{--^[^(yt~iyif  +  "^(yi  - 1'2)‘]]  ■ 
'  reD  tgo 

(1.2) 

If  we  let  P{V)  denote  the  prior  probability  density  of 
V  under  a  deformable  template  model  then  by  Bayes 
theorem 

P{y\V)P{V)  (1.3) 

is  proportional  to  the  posterior  density  of  V^  given  the 
data.  We  seek  an  estimate  of  V  to  maximise  (1.3)  This 
estimate  is  known  as  the  MAP  or  ‘maximum  a  posteriori’ 
estimate. 

The  main  focus  in  this  paper  is  on  suitable  models  for 
P(V)  which  we  explore  in  Sections  2-3.  To  some  extent 
our  paper  is  a  review  of  models  proposed  by  previous 
authors,  but  we  also  bring  out  some  unifying  themes 
behind  the  models  together  with  some  new  results. 
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2  Complex  normal  models  on 
outlines 


Suppose  {Re  zi ,  Im  zi,. Re  z„,  Im  z„)'  =  r  say,  fol¬ 
lows  a  2n-dimensional  (real)  normal  distribution  with 
mean  {Re  pi,Im  pi,...,Re  Pn)  and  2n  x  2n 

covariance  matrix  Q  say.  Typically  Cl  will  be  small 
so  that  the  distribution  of  the  observed  set  of  land¬ 
marks  V  =  {zj}  will  not  be  too  far  from  the  template 
Vo  =  {Pj}-  For  simplicity  we  shall  suppose  that  n  pos¬ 
sesses  complex  symmetry.  That  is  if  we  write  fl  =  (Hji) 
in  terms  of  2  x  2  blocks  Cljk,  j,k  =  1, . . . ,  n,  then 


Cljk  =  Ojt 


f  cosOjt 
\sin6jk 


—sinOjk 

cosBjk 


for  some  ajk  =  atj  >  0  and  angle  Ojk  =  —Okj  G  (0,27r). 
In  particular  =  0.  We  can  also  represent  H  as  an 
n  X  n  complex  matrix  £  with  (Xjk  =  ajkexp{iBjk).  Then 
r'flr  =  z*Ez  where  r'  denotes  the  transpose  of  r  and 
z*  =  z'  denotes  the  transpose  of  the  complex  conjugate 
of  z. 

Complex  symmetry  is  often  too  restictive  an  assump¬ 
tion  to  lead  to  good  models  for  outline  data  (see  eg.  the 
figures  in  Goodall,1991:  Dryden  &  Mardia,1991).  How¬ 
ever,  since  it  may  not  be  essential  to  specify  the  prior  dis¬ 
tribution  P{V)  very  precisely,  the  assumptions  of  com¬ 
plex  symmetry  may  be  adequate.  In  any  case  the  models 
here  can  be  generalised  to  the  non-complex-symmetric 
case  at  the  expense  of  extra  notation  and  additional  pa¬ 
rameters. 

The  simplest  general  model  for  the  vertices  z  = 
{zi,...,ZnY  about  fl  =  {fii , . . . ,  fi„y  is  a  multivariate 
complex  normal  model 


/(z)  oc  ezp{-i(z-^)*A(z-/i)}. 


where  the  inverse  covariance  matrix  A  is  Hermitian.  It 
is  simplest  to  suppose  that  A  is  positive  semi-definite  of 
rank  n-1,  with  A1  =  0  so  that  the  distribution  of  z  is 
improper.  Here  1  =  (1, 1, . . .,  1)'  and  0  =  (0,0, . .  .,0)'. 
Thus  /(z)  =  f{z+al)  for  any  a  €  C.  The  reason  for  this 
choice  is  that  we  are  not  usually  interested  in  location 
differences  when  judging  the  similarity  of  a  given  outline 
z  to  the  template  fi. 

Without  further  restriction  the  matrix  A  contains  too 
many  parameters  to  represent  a  useful  model.  Therefore 
it  is  of  interest  to  look  at  some  special  cases. 


2.1  The  vertex  CAR  model 

Following  Besag  (1974)  the  simplest  model  for  the  ver¬ 
tices  is  a  first-order  conditional  autoregressive  (CAR) 


model.  Equivalently  we  require  A  to  be  cyclic  tridiago¬ 
nal.  The  conditional  distribution  of  zj  given  the  rest  of 
the  points  {zk  :  k  ^  j}  is  complex  normal  with  first  two 
moments 


E[zj\rest]  =  a, Zj _ i  -V Pj Zj + 1 

vorfzjjrest]  =  r?  (2.1) 

say,  where  aj,Pj  e  C,  Qjlrj  =  and 

aj  -1-  Pj  =  l,j=l,...,n.  In  terms  of  the  elements 

of  A, 

Ojj  —  1/^/ )  ~  ~Pi 

Here  and  elsewhere  we  interpret  the  subscripts  mod  n. 
Remember  that  the  parameters  must  be  chosen  so  that 
A  is  positive  semi-definite  of  rank  n-1. 

The  simplest  version  of  this  model  is  obtained  when 
Tj  =  say,  does  not  depend  on  j  and  Oj  =  Pj  =  1/2 
for  all  j. 

2.2  The  CAR  transformation  model  on 
edges 

Let  ej  =  Zj  -  Z;_i,  Tfj  =  fij  -  fij.i,  j=l . n,  de¬ 

note  the  edges  between  successive  vertices  in  the  ran¬ 
dom  outlines  and  the  template,  respectively.  Note  that 
E  =  E  »?>  =  0-  Write 


ej  —  (1  -|-  tj)rij,  tj  G  C.  (2.3) 

Then  tj  measures  the  extent  to  which  Cj  differs  from  tem¬ 
plate  edge  rjj.  Chow  et  al  (1988)  proposed  a  conditional 
cyclic-stationary  first-order  CAR  model  for 
conditioning  on  ^tjf)j  =  0- 

In  its  unconditional  form  the  CAR  model  for  the  {/j} 
can  be  written  in  the  form 


E[tj\rest]  =  -{6/P)tj-i  -  {6/P)tj+i, 

var[tj|res<]  =  1//?,  (2.4) 

where  /S  >  0  and  6  €  C.  Thus  the  (unconditional)  pdf  of 
is  proportional  to 

=  -V2Rc(6X!<Vi+i)]}-  (2-5) 

The  sums  here  range  over  j  =  1 , . . . ,  n  and  subscripts  are 
to  be  interpreted  mod  n.  A  sufficient  condition  to  ensure 
that  the  covariance  matrix  of  the  {t^  }  is  positive  definite 
is  \6\/P  <  1/2.  After  conditioning,  the  distribution  of 
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(<i , . . . ,  <n)  is  no  longer  a  CAR,  though  it  is  still  complex 
normal. 

A  multivariate  complex  normal  distribution  on 
ti,.  ..,tn  induces  a  multivariate  normal  distribution  on 
the  edges  Further  if  we  allow  the  location 

of  the  vertices  zi, . . . ,  (as  measured  by  the  centroid, 
say)  to  have  an  improper  uniform  distribution  over  C 
(note  the  location  of  the  outline  is  not  determined  by 
the  edges),  then  we  can  transform  the  above  distribu¬ 
tion  on  edges  to  give  an  improper  multivariate  complex 
normal  distribution  on  the  vertices  zi, . . . ,  z„. 

Write  u)j  =  zj  —fXj .  After  a  little  algebra  it  follows  that 
the  distribution  of  (zi,...,z„)  is  an  improper  second- 
order  CAR  with 

E[uj\rest]  =  r/{/3(|7jj  -t-  |rjj  +  i|-2wj  +  i) 

-6rij  _  1  hj  |»jj  - 1 1  "^(wj  _  1  -  wj  _2) 

+^Vj+i^j+2\ni+i\~^\Vj+7\~^(‘*^j+‘2  -  ‘^i+i) 

(2-6) 

var[wj|res/]  =  r?, 

where 

-h\~^\Vj  +  i\~H6iijr}j  +  i  +  6T]jf}j  +  i)}-\  (2.7) 

An  important  special  case  occurs  when  the  template 
vertices  form  a  regular  polygon,  i.e.  fij  =  exp{2nij/n). 
In  this  case 

E[u)j\rest]  =  r^|a  -  lp{(i3  -  25a)wj_i 

+{0  —  26a)uij^i  +  iawj_2  -f  ^duij+2}i 

(2.8) 

var[u}j\rest]  =  r^, 

where 

—  ^\a—\^{0—Re  6a)~^ ,  and  a  =  exp(2xi/n).  (2.9) 

2.3  A  Covariance  Model 

The  above  models  are  useful  when  landmarks  can  be 
consistently  identified  on  the  template  and  the  observed 
outline.  However,  in  some  examples,  eg.  an  outline  of 
a  biological  cell,  there  are  no  identifiable  features  and 
the  n  landmarks  might  be  defined  to  be  equally-spaced 
(in  terms  of  arc  length)  around  the  outline  of  the  object. 
In  this  case  it  is  reasonable  to  take  the  template  to  be 
a  regular  n-sided  polygon,  with  pj  =  exp(2irij/n),  and 
to  model  the  variety  of  possible  shapes  using  a  circu- 
lant  Toeplitz  covariance  matrix,  as  proposed  by  Miller 


et  al  (1991).  Defining  ti, . .  .,t„  as  in  (2.3),  they  model 
(ti,...,t„)  as  a  multivariate  complex  normal  distribu¬ 
tion  with  circulant  Toeplitz  covariance  matrix  B,  say, 
conditional  on  ^tjCj  =  0.  That  is,  B  =  (bjt)  has  en¬ 
tries  bjk  =  ctj-t,  say,  where  aj-t  =  at-j-  Here  as 
elsewhere  subscripts  are  to  be  interpreted  mod  n.  The 
eigenvectors  of  B  are 

gi  =  -^[exp{-2Trijk/n),j  =  1, . .  .,n]' 

for  j  =  1, . . .,  n  with  eigenvalues 

n 

At  =  ^ajexp{-2itijk/n), 

;  =  i 

k  =  1, . . . ,  n.  The  eigenvalues,  assumed  to  be  non¬ 
negative,  are  not  necessarily  in  any  monotone  order. 

Let  G  be  the  (n  x  n)  unitary  matrix  with  columns  gt, 
and  set 

s  =  G*t 

to  be  the  vector  of  principal  components.  The  constraint 
^tj€j  =  0  takes  an  appealing  form  in  terms  of  principal 
components, 

=  [1  - 

=  yn[l-e-2Wn]s,  =0, 

that  is.  Si  =  0.  Miller  et  al  (1991)  suggest  estimating 
the  parameters  in  B  by  using  a  training  sample  of  m 
outlines.  Equivalently,  after  rotating  the  principal  com¬ 
ponents  for  each  outline,  one  can  estimate  the  eigenvalue 
A*  by  (2m)~*  times  the  sum  of  squared  absolute  values 
of  the  principal  component  in  the  training  sample. 
Further,  since  each  outline  in  the  training  sample  will 
satisfy  the  constraint  wi  =  0,  we  will  always  estimate 
Al  =0. 

Miller  et  al  (1991)  also  suggest  a  modification  to 
this  model  in  which  the  real  and  imaginary  parts  of 
(<!,...,<„)  are  modelled  independently  using  separate 
circulant  Toeplitz  matrices  (with  real  entries).  However 
this  modification  lacks  the  appealing  rotational  invari¬ 
ance  of  the  original  model.  Note  that  the  CAR  in  (2.8) 
and  (2.9)  is  a  special  case  of  this  model. 

3  Continuous  deformable  tem¬ 
plate  models  -  thin  plate 
splines 

Another  way  to  model  the  transformation  between  the 
template  and  the  realised  outline  is  to  fit  a  deformation 
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of  C,  that  is  a  continuous  transformation  z  •-+  w(z),  from 
C  to  C.  The  most  common  such  model  is  the  thin-plate 
spline  (Bookstein,1989).  The  purpose  of  this  section  is 
to  explore  some  of  the  algebraic  aspects  of  this  method. 

For  this  section  let  zj  =  (xj  +iyj),j  =  1 , . . . , n  denote 
the  landmarks  in  the  template  (denoted  by  pj  before) 
and  Wj  =  Uj  +  ivj ,  the  transformed  landmarks  (denoted 
by  Zj  before).  The  output  of  the  thin-plate  spline  al¬ 
gorithm  is  a  function  w  •.  C  C,  w{z)  =  u(z)  -f  it)(z), 
satisfying  w{zj)  =  Wj,j  =  .  .,n. 

One  way  to  calculate  the  thin-plate  spline  is  through 
kriging,  which  we  now  briefly  describe.  The  functions 
u(z)  and  v{z)  are  fitted  separately  as  follows.  Consider 
the  function 

(r{z)  =  \z\^log\z\^  +  ci  -|-  +  caj/^  c^xy  (3.1) 

where  z  =  x-t-iy  and  ci ,  C2,  C3,  C4  are  arbitrary  constants. 
(We  shall  see  below  that  the  choice  of  ci,C2,  C3,  C4  has  no 
effect  on  the  final  solution). 

Let  zo  =  xo  iyo  be  a  new  point  at  which  we  wish 
to  define  u(zo).  (In  this  section  subscripts  are  not  to  be 
interpreted  mod  n;  zq  should  not  be  identified  with  Zn  )• 
The  kriging  approach  says  to  take  «(z)  =  a'u  where  a 
(depending  on  zq)  is  chosen  to  minimise 

0'A0  subject  to  S/?  =  0  (3.2) 

where  0  =  (— l,a')'  is  an  (n  +  1)— vector,  A  is  an 
(n  -f  1)  X  (n  -t-  1)  matrix  with  entries 

ajk  =  -  Zk)  (3.3) 

and  Sisa3x(n-|-1)  matrix, 

/I  1  1\ 

S  =  I  Xo  xi  ...  x„  I  .  (3.4) 

\  j/o  yi  ■  •  •  ynj 

The  matrix  A  is  conditionally  positive  definite;  that  is 
P'A0  >  0  if  ^  0  and  S/?  =  0  (Matheron,1972).  Fur¬ 
ther  it  is  easily  checked  that  if  S/3  =  0  then  /?'A/?  does 
not  depend  on  the  values  of  cj ,  C2,  C3,  C4  above. 

The  motivation  for  the  criteria  (3.2)  comes  from  the 
theory  of  first-order  intrinsic  random  fields.  There 
exists  a  real-valued  intrinsic  random  field  {X(z) 
z  6  C}  such  that  whenever  =  0,  the  increment 
/3'[X(zo), . . . ,  X(z„)]'  has  mean  0  and  variance  0'A0. 
Further,  if  /?  =  (-l,o')',  then 

t>ar{/?'[X(zo),...,X(z„)]'} 

=  .E{X(zo)-o'[X(z, ),...,  X(z„)n2 

represents  the  prediction  mean  squared  error  of  the  ran¬ 
dom  field  at  the  new  site  zq  in  terms  of  a  linear  combi¬ 
nation  of  its  values  at  the  existing  sites  zj , . . . ,  z„. 


Partition 


where  the  matrix  E  depends  only  on  the  data  zi , . . . ,  z„, 
but  the  vector  i  (n  x  1)  also  depends  on  the  coordinates 
of  the  new  point  zo . 

With  a  suitable  choice  of  ci, £2,03,04  we  can  ensure 
that  E  is  positive  definite.  Then  using  Lagrange  multi¬ 
pliers  it  is  straightforward  to  show  that  the  choice  of  a 
minimising  (3.2)  is 

a  =  E-^^  -  E-iT'(TE"^TO"‘TE-^i 

_E-iT'(TE-'T')-Ml,*o,yo]'  (3.6) 

so  that 

u(zo)  =  u'B^-£'  ^xo  j  ,  (3.7) 

say.  Here  f  =  (TE“^T')~*TE~*u  is  the  generalised 
least  square  regression  coefficient  of  Uj  on  (1,  x^,  y^),  j  = 
l,...,n  and  f'(l,xo,yo)'  is  the  generalised  least  squares 
predictor  at  the  new  point  zq.  It  can  also  be  checked  that 
the  value  of  t  does  not  depend  on  Ci ,  07, 03, 04  above. 

Also,  in  (3.7), 

B  =  E"^  -  E-*T'(TE"*T')"*TE-^  (3.8) 

If  we  let  P  =  x'(TT^)“*T  denote  the  orthogonal  projec¬ 

tion  matrix  in  3?”  onto  the  columns  of  T'  (so  PT'  =  T'), 
then  it  is  not  difficult  to  show  that 

B  =  [(I-P)E(I-F)]-  (3.9) 

where  [  ]“  denotes  the  Moore-Penrose  generalized  in¬ 
verse.  Further  B"  =  (I  —  P)S(I  —  P).  Note  that  the 
eigenvectors  of  B“  (corresponding  to  the  non-zero  eigen¬ 
values)  are  all  orthogonal  to  the  columns  of  T'.  Hence 
the  matrix  B“  (and  therefore  B)  does  not  depend  on 
the  arbitrary  choice  of  c\ ,  02, 03, 04. 

The  quantity  u'Bu  is  identified  (see,  for  example, 
W2ihba,  1990,  p33)  as  being  proportional  to  the  bend¬ 
ing  energy  of  the  transformation  z  t-f  u(z).  It  is  also 
easily  checked  that  B  =  BSB  and  B  =  B(I  —  P).  The 
thin-plate  spline  for  t;(z)  proceeds  similarly. 

Hence,  given  an  underlying  template  of  landmarks 
-1, . . . ,  z„,  it  is  natural  to  model  the  deformed  landmarks 
wt, . .  .,w„  (wj  =  Uj  +  ivj )  using  a  complex  normal  dis¬ 
tribution  based  on  the  bending  energy, 

P({u;j})  a  exp{-^[u'Bu  +  v'Bv]} 
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or,  P({wj})  ocexp{-^w*Bw}  (3.10) 

where  is  a  scale  parameter.  This  density  is  improper 
since  B  has  rank  n  —  3  and  further  all  linear  transforma¬ 
tions  of  wi,. .  .,Wn  have  the  same  density. 

It  would  be  interesting  to  apply  this  model  in  the  anal¬ 
ysis  of  images.  Other  than  the  scale  constant  it  con¬ 
tains  no  parameters  to  choose,  once  zi , . . . ,  2„  are  given. 
One  choice  of  zj , . . . ,  2„  is  to  take  these  as  vertices  of 
a  regular  polygon.  Then  B  simplifies  somewhat  since 
is  circulant  Toeplitz  as  in  the  model  of  section  2.3 
above.  Further  a  similar  construction  can  be  carried  out 
in  dimensions  other  than  2. 

4  Hand  Reconstruction 

We  now  consider  an  example  of  shape  reconstruction  for 
the  human  hand  using  the  model  (1.1)  for  observation 
noise  and  the  edge  model  of  section  2.2  for  the  prior 
distribution  of  the  shape.  The  hand  in  the  image  is  a 
real  human  hand  and  the  template  is  formed  from  the 
average  of  8  real  hands.  The  data  were  provided  by  Dan 
Keenan.  Our  example  is  motivated  by  Chow  et  al  (1988). 

The  shape  model  of  section  2.2  contains  two  parame¬ 
ters  6  £C  and  /?  >  0.  In  our  experiments  we  have  limited 
consideration  to  6  real.  We  have  reparameterised  6  and 
0  in  terms  of  A  and  where 

X  =  {0-i0^-4Sy/^}/26,  0-2  =  (/?2_4^2)-i/2,  (4.1) 

because  they  have  more  intuitive  interpretations  as  the 
usual  first-order  autocorrelation  and  the  marginal  vari¬ 
ance  respectively,  in  the  analogous  discrete-time  ARl 
process  from  time  series. 

Our  reconstruction  procedure  can  be  conveniently 
split  into  3  stages. 

1.  First  we  want  to  find  the  appropriate  location  and 
orientation  of  the  hand  in  the  image,  using  a  variant 
of  thresholding.  Our  approach  has  been  to  use  the 
a’ternating  mean  thresholding  and  median  filter¬ 
ing  (AMT-MF)  approach  of  Mardia  and  Hainsworth 
(1988)  to  obtain  a  binary  image.  Setting  yt  =  \  in¬ 
side  the  largest  connected  component  and  j//  =  0 
elsewhere  gives  an  initial  reconstruction.  Here  (  = 

labels  the  pixels  of  the  image. 

2.  Given  a  similar  binary  image  {i/}  for  the  interior  of 
the  template  hand,  and  treating  f  as  a  (2x  1)  column 
vector  1,  construct  an  affine  map  AI  -t-  b  so  that  the 
first  two  moments  of  {Al  -I-  b  :  x/  =  1}  match  the 
first  two  moments  of  {I  :  )/<  =  1}.  Thir-'  moments 
are  used  to  resolve  any  orientation  ambiguities. 


Further,  small  rotations  of  the  template  are  exam¬ 
ined  to  improve  the  fit,  using  the  matching  coeffi¬ 
cient 

<P  =  ^xiin/{Y^{xi  +  yt-xiyi)}.  (4.2) 

3.  We  now  make  use  of  the  shape  model  of  section  2.2 
in  an  approximate  ICM  algorithm  (see  Besag,1986). 
We  cycle  through  the  vertices  zj  one  at  a  time  and 
using  a  grid  search  consider  updates  of  Zj  to  max¬ 
imise  the  posterior  density  (1.3).  These  cycles  are 
iterated  until  convergence.  In  our  example  4  cycles 
usually  sufficed,  reducing  the  grid  size  from  9x9  pix¬ 
els  down  to  3  X  3  pixels  as  we  progressed  through 
the  cycles. 

Several  features  in  our  reconstruction  algorithm  are 
worth  emphasising. 

(a)  The  initial  reconstruction  (Stages  1  and  2)  has  a 
very  important  effect  on  the  quality  of  the  final  recon¬ 
struction.  (b)  The  number  and  location  of  the  vertices  is 
important.  At  the  very  least,  to  represent  a  hand  we  re¬ 
quire  the  tips  of  the  fingers  and  the  lowest  points  between 
them  and  points  at  the  wrist.  However,  our  experiments 
indicate  that  these  alone  are  not  nearly  enough,  and  typ¬ 
ically  we  take  a  template  with  51  vertices  as  in  Figure 
1(a).  In  total  there  are  256  points  on  the  template,  and 
the  intermediate  points  are  updated  by  interpolation.  It 
is  also  important  not  to  have  too  many  vertices.  Be¬ 
cause  our  updating  algorithm  changes  only  one  vertex 
at  a  time,  it  can  get  stuck  in  an  unsuitable  local  opti¬ 
mum.  Simulated  annealing  offers  another  way  to  cope 
with  this  difficulty,  but  at  increased  computational  cost, 
(c)  We  have  also  imposed  a  ‘hard-core’  restriction  to  pre¬ 
vent  vertices  getting  too  close  together,  eg.  bunching  up 
near  the  tips  of  the  fingers.  In  our  experiments  a  min¬ 
imum  distance  of  3  pixels  between  vertices  was  found 
useful.  Bunching  is  generally  a  problem  only  when  the 
noise  is  high. 

Figure  1  shows  the  results  of  our  algorithm  on  a  256  x 
256  image.  In  (b)  we  have  the  true  image,  to  which 
N(0,<t^)  noise,  =  4,  has  been  added,  (c).  Here  i/i  = 
I,  1/2  =  0.  Naive  thresholding  at  (t/i  -F  j/2)/2  would 
give  an  error  rate  of  40  %.  In  (d)  we  have  the  effect  of 
applying  AMT-MF  and  (e)  gives  its  largest  components. 
In  (f)  ve  have  the  final  reconstruction  after  4  iterations 
of  Stage  3,  with  parameters  A  =  0.5,  (T  =  0.2.  The  pixel 
by  pixel  error  rate  is  under  2  %. 

Stage  1  does  not  assume  any  knowledge  of  i/i  and  U2. 
For  comparison  we  tried  Stage  3  assuming  i/i  and  t/2 
known  (yielding  a  matching  coefficient  of  ip  =  0.83  and 
displayed  in  Figure  1(f)  )  and  with  i/i  and  1/2  estimated 
from  Stage  1  and  used  in  Stage  3  (yielding  <p  =  0.81  ). 
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Thus  knowing  Ux  and  vi  leads  to  only  a  slight  improve¬ 
ment  in  the  reconstruction.  The  ratio  is  treated 

as  known  and  has  been  chosen  by  trial  and  error.  Figure 
2  shows  our  initial  global  matching  afer  Stage  2. 

5  Other  Work 

For  a  review  of  other  methods  of  deformable  templates, 
see  Lipson  el  al  (1990).  Amit  el  o/(1991)  describe  pixel- 
based  approach  to  fitting  a  deformation  of  C-  A  de¬ 
scription  of  the  difficulties  involved  in  three-dimensional 
problems  is  given  by  Grenander  and  Keenan  (1989). 
Models  for  curvilinear  shapes  such  as  letters  of  the  al¬ 
phabet  are  discussed  by  Manbeck  el  al  (1991).  They 
use  a  second-order  SAR,  but  note  that  the  CAR  model 
in  (2.6)  and  (2.7),  this  time  with  boundary  conditions, 
again  provides  a  useful  model. 

Face  recognition  is  an  interesting  application  area  for 
shape  identification  (cf.  Bruce, 1988,  Craw  &  Tock,1991). 
Here  there  are  nesting  constraints.  For  example,  pupils 
are  nested  within  eyes,  teeth  within  lips,  eyes  and  lips 
within  the  head,  etc. 

To  sum  up,  the  area  of  shape  reconstruction  poses 
many  interesting  statistical  problems. 
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In  this  brief  Discussion  I  would  like  to  weave 
together  the  ke}'  words  and  phrases  from  the  title 
of  this  session:  “multivariate  statistics,”  “labelled 
point  data,”  and  “visualization.”  Their  interplay 
underlies  a  useful  duality  of  approaches  to  many 
problems  of  data  analysis  in  a  domain  of  growing 
practical  importance:  statistics  of  image  data. 

Data  from  images 

The  problems  I  have  in  mind  originate  in 
data  that  are  already  visualized.  The  source  of 
information  might  be  a  medical  image,  perhaps, 
indicating  a  physical  property  (some  sort  of 
interaction  with  radiation)  within  each  of  a  grid  of 
little  volumes  inside  a  region  of  tissue.  Or  it  might 
be  a  geological  survey,  showing  physical  or 
electromagnetic  properties  near  each  of  a  vastly 
sparser  grid  of  points  on  the  surface  of  the  earth 
(or  in  a  solid  chunk  of  its  interior);  or  a  weather 
map,  indica  ting  physical  properties  of  the  air 
(composition,  temperature,  suspensates,  velocity) 
near  a  still  sparser  grid  of  points  in  the  three- 
dimensionad  atmosphere.  In  a  great  variety  of 
scientific  contexts,  our  concern  is  to  investigate 
aspects  of  this  sort  of  data  many  sets  at  a  time:  a 
heap  of  synoptic  weather  maps,  a  span  of  eons  of 
continental  drift,  or  a  sample  of  biological  or 
biomedical  histories.  That  is,  we  wish  a  synthetic 
image  concentrating  certain  features  of  particular 
interest  drawn  from  the  context  of  rather  dilute 
information  that  is  each  original  brain  scan,  or 
survey,  or  weather  map.  To  emphasize  this  task  of 
comparison  rather  than  the  pursuit  of  arbitrary 
detail  is  to  ask  a  different  sort  of  question  than  that 
the  original  visualization  was  designed  to  answer. 
The  goal  now  is  to  retrieve  not  what  is  unique  to 
each  instance  but  what  is  common  to  all,  what  is 
most  variable  among  them,  what  typically  covaries 
with  exogenous  causes  or  effects,  etc.  Tools  are 


needed  for  canning  out  some  steps  in  this  search  in 
a  manner  requiring  the  least  attention,  the  least 
interaction,  or  the  greatest  degree  of  automation. 

Let  us  be  a  bit  more  formal.  The  data  under 
discussion  are,  in  general,  vectors  at  each  point  of  a 
domain  organized  on  Euclidean  principles— 
multicolored  pixels  or  voxels  or  air  samples  out  in 
the  woods.  Think  of  these  as  a  spectrum  of  scalars 
on  a  surface  drawn  or  interpolated  “above”  each 
point  of  the  domain.  Such  pictures  can  be  very 
beautiful,  but  we  shall  ignore  that  distraction.  We 
pursue  the  alternative  visualizations  to  arrive  at 
scientific  understanding,  not  necessarily  further 
pretty  pictures.  The  Cartesian  product  of  picture 
plane  or  grid  or  space  by  the  length  of  the  vector  of 
observation  is  occasionally  augmented  by  another 
Cartesian  factor  corresponding  to  replication  (e.g., 
over  time);  ordinarily,  but  not  always,  this  factor 
can  be  folded  into  the  length  of  the  vector  content. 

Verticeil  and  horizontal 

A  conventional  multivariate  analysis  will 
usually  deal  with  aspects  of  this  surface  in  a 
manner  that  ignores  the  prior  geometric  ordering. 
The  usual  multivariate  formalism  proffers  only  one 
quadratic  form,  representing  the  variance- 
covariance  matrix;  perhaps  there  is  also  a  custom- 
designed  “error  covariance  matrix”  incorporating 
information  about  adjacencies.  Otherwise  the 
geometric  origin  of  the  index  set  of  variables  is 
nowhere  in  evidence.  Let  us  call  that  kind  of 
statistics  “vertical.”  There  is  always  additional 
information,  then,  in  the  horizontal  part  of  this 
imagined  figure— the  information  about  where  the 
labelled  locations  of  the  ground  plane  actually  lie, 
and  how  their  locations  covary  with  height(s)  of  the 
8urface(s)  above  them.  This  horizontal  part  is  what 
we  primates  are  used  to  processing.  It  is  borne  on 
the  nonlinear  world  of  the  retina,  rich  in  alternate 
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visualizations  of  pattern  and  color,  of  depth  and  of 
motion,  of  prey  and  of  predators,  of  swinging  from 
limb  to  limb  or  leg  to  leg.  The  conventional 
multivariate  approach  ignores  the  evolutionary 
history  of  the  organism  that  has  brought  forth 
statisticians.  The  single  visualization  of  abstract 
linear  vector  spaces  hardly  deserves  the  name  of 
“visualization”  at  all.  Whenever  data  is  originally 
visual,  and  especially  if  it  is  originally  gridded,  the 
linear  machinery  must  be  supplemented,  if  not 
wholly  supplanted,  by  a  semantics  of  pushes  and 
pulls,  of  motion  and  deformation. 

Matters  are  easiest  if  we  restrict  ourselves, 
in  accord  with  the  title  of  the  present  session,  to 
labelled  points  in  this  horizontal  dimension,  points 
that  have  specific  names  corresponding  from 
instance  to  instance  of  the  image.  Note  that  it  is 
the  points  of  the  ground  plane  (or  ground  space,  in 
the  3-d  applications)  that  are  labelled,  not  the 
points  of  the  imaginary  “surfaces”  of  data  floating 
above  them.  These  labels  can  be  fixed  n-uples,  like 
map  coordinates,  or  they  can  move  about  on  a 
subordinate  map  of  their  own,  like  the  bridge  of 
your  nose  as  it  locates  your  eyeglasses  on  your 
profile.  Labelled  points,  it  turns  out,  support  a 
feature  space  quite  a  bit  more  promising  than  that 
which  is  accessible  in  the  general  run  of 
multivariate  problems.  For  instance,  sets  of 
labelled  points  have  their  own  metrics,  usually 
hugely  transmogrified  versions  of  ordinary 
interpoint  distances  or  Cartesian  coordinates,  that 
complement  the  usual  covariance-based  metric  of 
multivariate  observations.  The  labelled  points  can 
move  about  in  their  Euclidean  domain  at  the  same 
time  that  images  change  above  them,  leading  to 
decompositions  of  the  variance  “at”  a  point  that  are 
very  interesting  both  scientifically  and  statistically. 
One  can  ask.  for  instance,  whether  regions  of  the 
cardiac  wall  that  show  abnormal  changes  of 
curvature,  as  indicated  by  the  relative  motion  of 
the  arterial  bifurcations  nearby,  are  the  same 
sectors  as  those  showing  anomalies  of  texture  in  an 
ultrasound  scan. 

The  three  lectures  in  this  session  all  deal 
with  the  relations  between  “vertical”  and 
“horizontal”  aspects  of  this  sort  of  data,  relations 
very  conveniently  filtered  through  the  low 
dimensionality  and  immense  graphical  power  of 
labelled  points.  In  the  first  lecture,  Paul  Sampson 
and  colleagues  show  a  visualization  of  the  relation 
between  the  two  distances,  horizontal  and  vertical 
(geometrical  and  statistical),  for  the  same  set  of 


labelled  points  which,  being  geographical  sites,  are 
in  fact  invariable  in  position.  This  relation  between 
two  distances  is  visualized  as  if  it  were  a 
deformation  of  the  map,  the  “rubber  sheet”  that  is 
our  most  familiar  imagery  for  changing  distance 
measures.  In  this  way  Sampson  reduces  “vertical” 
covariation  to  “horizontal”  visualization.  Tn  this 
synthetic  horizontal  structure,  the  principal 
components  that  would  be  lengthy  vectors  of 
coefficients  point  by  point  in  the  “vertical”  analysis 
become,  instead,  extended  curves  at  90  the 
biorthogonal  grid  of  the  horizontal  analysis.  That 
is,  we  have  turned  the  relation  between  a  vertical 
covariance  matrix  and  a  horizontal  geometric 
distance  matrix  into  a  single  visual  entity,  a 
horizontal  symmetric  tensor  field,  graphed  using  a 
pair  of  directions  at  every  point. 

The  other  two  lectures,  Lange’s  and 
Mardia’s,  may  be  thought  of  as  treating  strategies 
for  understanding  the  interplay  of  “vertical”  and 
“horizontal”  analyses  of  the  same  “three- 
dimensional”  topographic  data.  For  instance,  a 
horizontal  analysis  may  be  best  if  one  wants  to  use 
the  geometry  of  the  labelled  image  rather  as  one 
uses  a  covariate  in  a  classic  experimental  design. 

In  this  case  it  is  as  if  the  shape  of  the  configuration 
of  labelled  points— the  very  basis  of  the  vector 
space  underljnng  the  data— is  effectively  nuisance 
variation.  “Controlling”  this  variation  increases  the 
precision  with  which  other  effects  can  be  addressed. 
That  is,  one  analyzes  oerticaZ/y- examining  the 
gradients  of  the  picture,  for  instance,  or  its 
correlations  with  physical  or  biological  processes— 
after  “unwarping”  horizontally  to  a  less  blurrj' 
feature  space  in  which  processes  more  nearly  “stay 
put”  to  have  their  averaged  picture  taken 
(Bookstein,  1991b).  In  multivau'iate  language,  we 
are  projecting  out  a  complicated  nonlinear  feature 
co-space.  The  experience  of  generations  of 
anatomists  has  shown  that  this  maneuver  improves 
the  power  of  subsequent  multivariate  tactics,  such 
as  discrimination  or  analysis  of  covariance.  When 
averaging  pictures  of  brain  activity  over  brains  of 
different  shape,  for  instance,  the  landmarks  serve 
as  guides  to  the  correspondence  of  regions  prior  to 
averaging.  It  is  the  landmarks,  not  the  squares  of 
the  grid  of  a  PET  reconstruction,  that  represent  the 
true  coordinate  system  for  valid  biometric  analyses. 

Horizontal  analyses 

In  other  applications,  this  “horizontal” 
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variation  is  not  noise  or  nuisance,  but  itself  a  signal 
in  its  own  right.  The  labelled  points  support  a  very 
powerful  low-dimensional  feature  space,  in  effect  a 
tangent  space  to  Kendall’s  shape  space  (Bookstein, 
1991;  Goodall,  1991)  in  the  vicinity  of  the  mean 
configuration  of  labelled  points.  With  the  aid  of  a 
convenient  basis  for  the  elements  of  this  shape 
space,  this  information  may  be  concentrated  into 
linear  features  of  its  own.  When  the  variables  of 
this  block  are  paired  with  less  delicately  crafted 
descriptors  of  the  original  “vertical”  scalar  or  vector 
content,  there  results  a  sort  of  hierarchical 
multiple-regression  approach  for  prediction  of  other 
images,  such  as  later  images  of  the  same  system, 
and  for  the  joint  evocation  of  shape  and  content  as 
a  bispectral  signal  in  a  detection  or  classification 
problem,  such  as  locating  tumors  or  quantifying 
their  recession  under  treatment.  For  brain  shape, 
for  instance,  the  statistics  of  this  “horizontal”  space 
suggest  some  unwarpings  that  might  be  unusually 
effective  at  unblurring  the  subsequent  vertical 
average  (Bookstein,  19916). 

It  turns  out  that  visualization  of  vectors  in 
the  feature  space  of  labelled  point  shape— that  is, 
shape  changes  in  sets  of  landmarks— is  at  least  as 
easy  as  visualization  of  changes  in  surfaces  above 
the  planes  or  volumes  tagged  by  those  points.  The 
best  visualizations  are  suggestive  of  the  process 
explanations,  the  “bulges”  and  “shears”  and 
“warps,”  that  are  automatically  familiar  to  any 
sentient  organism  that  ever  navigated  a  binocular 
landscape.  The  combination  of  features  of  labelled 
point  shape  with  features  of  the  image  “at  the 
average  shape”— the  careful  separation  of  vertical 
from  horizontal  variation  in  these  mixed  feature 
spaces,  and  the  careful,  specialized  visualization  of 
the  horizontal— is,  in  my  view,  the  most  powerful 
generator  available  for  good  analyses  of  biometrical 
images. 

Mixed  analyses 

The  freedom  to  combine  a  geometric  metric 
with  the  customary  statistical  one  is  unfamiliar  to 
most  applied  statisticians.  An  analogy  from 
physical  science  may  be  useful:  this  is  precisely  the 
same  freedom  as  is  granted  us  by  Newtonian 
mechanics— the  existence  of  absolute  space,  and 
absolute  time,  and  hence  an  absolute  scale  of 
relative  velocities  in  meters  per  second,  independent 
of  all  the  other  quantitative  laws  of  physics.  (It  is 
this  decoupling  that  is  contravened  by  Einstein’s 


special  theor}'^  of  relativity;  but  Einstein  was  not  a 
statistician  in  the  sense  we  are  using  that  word 
here.)  Geometrically,  this  construction  is  called  a 
“Galilean  metric.”  Space  is  measxored  in 
centimeters,  and  time  in  seconds,  and  there  is  no 
absolute  constant  of  conversion  between  them— no 
“speed-of-light”— but  only  diverse  objects  and  their 
velocities,  each  of  which  is  an  empirical  matter.  In 
the  analogous  context  of  images  over  labelled  point 
data,  there  is  a  collection  of  distance  measures  for 
landmark  configurations  and  another  collection  of 
distance  measures  for  multivariate  distributions; 
and  the  relation  of  these  two  sets  of  measures  is 
purely  an  empirical  matter,  as  encoded,  for 
instance,  in  a  singular-value  decomposition.  The 
peculiar  advantage  of  labelled  point  data  is  the 
unexpected  simplicity  of  its  own  statistical 
structure.  Many  transformations  that  appear 
hopelessly  nonlinear  in  terms  of  the  multivariate 
space  oriented  “vertically”  turn  out  to  be  linear,  or 
nearly  so,  in  aspects  of  the  same  space  viewed  and 
measured  horizontally.  There  are,  then,  many 
more  practicable  and  interesting  directions  of 
projection  of  these  composite  spaces  than  would  be 
available  in  an  ordinary  problem  having  the  same 
net  count  of  degrees  of  freedom. 

Let  us  consider,  for  example,  the  difference 
between  two  kinds  of  analysis  of  relations  among 
pictures:  “motion”  and  “deformation.”  For  the  case 
of  “motion,”  consider  a  one-dimensional  picture, 
pixels  in  a  line.  Our  task  is  to  detect  an  extended 
point  moving  uniformly  along  this  grid.  Under  the 
(physically  reasonable)  assumption  of  linear 
motion,  any  such  detection  is  a  linear  projection— 
the  averaging  of  values  pix- 14)— in  a  direction  of  the 
Cartesian  product  space  (image  pixels  by 
replications)  taking  account  of  the  speed  v  of  the 
motion.  Hence  motion  of  a  point  can  be  detected  by 
a  one-dimensional  suite  (varying  u)  of  linear 
operators  applied  to  a  higher-dimensional 
representation. 

But  consider,  now,  the  problem  of  detecting 
deformation,  like  the  reflection  of  your  face  in  a 
flawed  mirror,  or  the  growth  of  your  child’s  face 
over  time.  The  corresponding  transformations  of 
feature  space  are  not  linear  in  the  extent  of 
deformation.  As  landmarks  move  over  distances  at 
greater  than  subpixel  scale,  the  linearity  of 
geometry  in  the  underlying  ground  plane  is 
converted  into  sharp  turns  in  the  linear  space  of 
vectors  over  pixels  in  which  the  conventional 
multivariate  statistics  is  mounted.  The  “same” 
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tissue  signal  lies  over  weighted  averages  of  pixels 
then  (pi+i,Pi+2).  etc.  Each  of  these 
segments  makes  an  angle  of  135°  with  its 
predecessor  and  successor,  and  angles  of  90°  with 
all  the  other  such  segments  in  the  linear  space  of 
pixel-by-pixel  content.  Then  as  landmarks  move 
over  pixels,  the  resulting  series  of  transformations 
is  far  from  a  Ihiear  extension:  in  multivariate 
space,  it  makes  a  wrenching  turn  every  time  a 
pixel  boundary  is  crossed.  Smoothing  algorithms 
can  lower  the  variance  associated  with  these  turns, 
but  they  cannot  evade  the  underlying  geometric 
infelicity.  Yet  the  averaging  of  biological  images  is 
made  vastly  more  powerful  when  these  nonlinear 
transformations  are  executed  first  (Bookstein, 
19916).  In  practice,  these  techniques  combine 
among  themselves  for  analyses  of  motion  over  a 
deforming  scene,  such  as  when  the  flexing  of  a  joint 
deforms  surrounding  tissues,  or  in  the  contraction 
of  a  heart  that  is  bouncing  on  its  tether  within  the 
chest  wall. 

The  variety  of  shape  metrics 

The  useful  metrics  for  shape  itself  are 
perhaps  unfamiliar.  They  include  the  Procrustes 
metric  of  minimal  rms  Euclidean  distance  (Goodall, 
1991),  the  deficient  metric  of  localized  shape 
difference  restricted  to  the  space  of  residuals  from 
affine  transformations  (Bookstein,  1991),  the 
hyperbolic  log-anisotropy  metric  for  imiform  shears 
(ibid.),  and  others  mixing  shape  information  with 
information  about  size.  The  available  composites 
for  combining  information  from  the  labelled  points 
with  information  from  the  “surface”  of  the  image 
are  then  far  more  intricate  than  those  of 
Hamiltonian  mechanics,  with  its  geometrization  of 
Newton’s  laws.  The  composite  metrics  apposite  to 
the  understanding  of  sets  of  images  can  incorporate 
correlations  of  “height”  and  its  spatial  derivatives 
with  shape  and  its  alternate  metrics  in  endless 
combinations  (Bookstein,  1991a).  Consider,  for 
instance,  the  problem  of  detecting  growth  in  a  brain 
tumor.  This  is  the  correlation  of  one  visual  texture 
to  the  interior  of  a  disk  under  a  barrel  distortion, 
and  the  correlation  of  another  field  of  motion  to  the 
exterior  of  the  same  disk,  all  as  constrained  (with 
considerable  real  physical  nonlinearity!)  by  the 
bony  margin  of  the  braincase.  The  resulting 
“metric”  has  no  easy  illustration  other  than  the 
very  picture  of  the  mixed  analysis  we  would 
thereby  be  operationalizing— warping  of  the  interior 


and  the  exterior  of  a  labelled  disk  (the  tumor 
“boundary”)  separately,  followed  by  vertical 
comparisons  of  tumor  texture,  dissections  of  the 
motion  of  arterial  bifurcations,  and  so  on. 

To  this  diversity  of  metrics  corresponds  an 
equal  diversity  of  notions  of  orthogonal  projection. 
T.he  number  of  different  ways  in  which  features  can 
be  measured  or,  alternatively,  projected  out  out  of 
these  composite  spaces,  is  thus  fairly  rich.  One’s 
choice  depends  on  the  specific  sort  of  pattern  being 
sought  in  the  data  analysis,  which  is  to  say,  on  the 
process  governing  the  composite  image  (weather, 
Alzheimer’s  disease,  continental  drift).  We  can 
seek  to  describe  the  variation  (in  the  labelling 
plane)  of  the  location  of  a  “point”  feature  (vanishing 
of  a  derivative)  or  instead  the  location  of  an  “edge” 
(vanishing  of  a  second  derivative);  or  we  may 
attempt  instead  to  minimize  the  variation  of 
location  of  these  features  so  as  to  ease  the  study  of 
something  else  about  the  picture  (for  example,  the 
texture  of  ventricular  borders  in  Binswanger’s 
disease).  One  might  extend  the  correspondence  of 
labelled  points  to  curves  connecting  them;  there 
results  a  tessellation  of  the  plane  into  corresponding 
regions  suited  for  regional  averages,  coefficients  of 
variation,  etc.  We  can  study  wave-like  phenomena 
either  as  the  vertical  motion  of  vectors  at  fixed 
points  or  as  the  horizontal  motion  of  nodes  at 
extremal  points,  whichever  corresponds  better  to 
the  dynamics  of  the  underlying  morphogenetic 
process.  We  can  attempt  to  measure  the  deviation 
of  a  spatial  surface  in  between  landmarks,  in  order 
to  study  its  regional  fractal  dimension  or  other 
aspects  of  geometrical  texture;  or  we  can  attempt 
to  flatten  this  variation  of  height  onto  a  map  so  as 
to  study  autocorrelation  of  grej'  levels  or  thickness 
of  surface  layers  in  the  true  Gaussian  surface 
metric.  Either  of  these  types  of  registration  may 
be  used  to  generate  sample  means  for  purely 
descriptive  purposes  or,  alternatively,  may  be 
turned  to  the  investigation  of  group  differences  or 
covariation  with  other  aspects  of  the  picture,  with 
causes,  or  with  effects.  We  can  correlate  values,  or 
gradients,  or  the  Hessian  of  the  scalar  load  over  a 
region  with  a  Cartesian  coordinate  or  with  a  tensor 
representing  some  aspect  of  the  relation  of  the 
labelling  configuration  to  the  mean,  such  as  the 
Jacobian  of  the  implied  deformation.  And  so  on, 
tlirough  many  other  possibilities,  whether  in  one 
dimension  or  in  a  higher  space. 
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Concluding  remark 

In  this  brief  compass  I  can  no  more  than  hint 
at  the  power  for  image  analysis  and  scientific 
insight  of  the  new  methods  that  exploit  labelled 
point  data  to  enrich  the  conventional  multivariate 
metaphor.  In  this  combination  of  real  (physical, 
binocular)  geometry  with  the  abstract  geometry  of 
linear  models  lies  the  key  to  most  problems  of 
pattern  detection  and  display  across  the  sciences 
that  begin  with  real  fields:  images  over  maps, 
bodies,  or  space,  in  one  instantiation  or  several,  in 
one  color  or  multispectrally,  at  one  instant  or  many 
in  linear  or  cyclic  time.  The  key  to  the  combination 
of  the  vertical  and  the  horizontal  metrics  is  their 
careful  separation  to  begin  with:  separation  of 
change  of  image  content  from  repositioning  of  the 
carrier  pixels.  The  separation  proceeds  best  with 
the  aid  of  an  unique  intermediate  statistical 
structure,  the  nonstandard  multivariate  technologj’ 
of  shape  space  for  labelled  point  configurations.  In 
its  peculiar  finite-dimensional  elegance,  this  space 
affords  a  basis  for  linear  features  of  the  arbitrarily 
nonlinear  transformations  that  we  see,  and  explain, 
in  the  real  world.  Many  of  the  processes 
accounting  for  variation  in  these  images  can  be 
modeled  as  nearly  linear  in  these  transformations, 
and  many  other  questions  are  made  quite  a  bit  less 
murky  after  that  linearizable  part  of  image 
variation  is  partialled  out  of  the  image. 
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Abstract 

Several  Markov  chain-based  methods  are  available  for 
sampling  from  a  posterior  distribution.  Two  important 
examples  are  the  Gibbs  sampler  and  the  Metropolis  al¬ 
gorithm.  In  addition,  several  strategies  are  available 
for  constructing  hybrid  algorithms.  This  paper  outlines 
some  of  the  strategies  that  are  available,  and  discusses 
some  theoretical  and  practical  issues  in  the  use  of  these 
strategies.  In  addition,  some  preliminary  efforts  to  use 
Markov  chains  to  control  dynamic  graphics  for  exploring 
higher-dimensional  posterior  distributions  are  outlined. 

1  Introduction 

Suppose  we  are  given  a  posterior  distribution  n-  on  a 
quantity  0  with  values  in  a  space  E.  Usually  E  will  be 
a  subset  of  IR*  and  tt  will  have  a  density  with  respect  to 
a  measure  p, 

=  7r(a;)/i(di). 

For  simplicity,  ir  will  be  used  to  denote  both  the  distribu¬ 
tion  and  the  density.  We  may  be  interested  in  computing 
a  particular  numerical  characteristic  of  tt,  or  more  gener¬ 
ally  in  developing  an  understanding  of  what  information 
TT  contains  about  9. 

Several  methods  for  computing  characteristics  of  pos¬ 
terior  distributions  are  now  available.  These  include 
asymptotic  approximations,  numerical  integration,  and 
sampling  or  Monte  Carlo  methods.  Sampling  meth¬ 
ods  for  examining  posterior  distributions  provide  ways 
of  generating  samples  with  the  property  that  the  em¬ 
pirical  distribution  of  the  sample,  or  an  appropriately 
weighted  empirical  distribution,  approximate  the  poste¬ 
rior  distribution.  Using  such  samples,  it  is  easy  to  esti¬ 
mate  characteristics  such  as  the  mean  or  standard  devi¬ 
ation  of  a  function  of  9.  Marginal  distributions  can  be 
estimated  using  smoothing  or,  in  some  cases,  variance 
reduction  methods.  In  addition,  for  equally  weighted 
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samples  methods  for  viewing  point  clouds,  such  as  ro¬ 
tating  plots  and  Grand  Tours,  can  be  used  to  examine 
the  joint  uncertainty  about  three  or  more  components  or 
features  of  9. 

A  number  of  different  sampling  methods  are  available. 
In  rare  cases  it  is  possible  to  sample  directly  from  the 
posterior  distribution  and  thus  obtain  an  i.i.d.  sample 
from  ir.  In  most  problems  this  is  not  possible.  Either 
the  sample  has  to  be  dependent,  or  the  distribution  used 
to  generate  the  sample  has  to  be  different  from  ir.  A 
method  that  uses  independent  samples  from  a  distribu¬ 
tion  similar  to  tt  is  importance  sampling.  The  sample  is 
then  weighted  to  make  up  for  the  difference  between  k 
and  the  distribution  used  to  generate  the  sample.  Over 
the  past  decade,  most  work  on  sampling  methods  for 
exploring  posterior  distributions  has  centered  on  impor¬ 
tance  sampling  (Geweke,  1989;  Stewart,  1979;  van  Dijk 
ei  al.,  1978;  Zellner  and  Rossi,  1984;  among  others).  An 
alternative  approach  that  avoids  the  need  for  weights  is 
to  use  a  dependent  sample,  such  as  the  sample  path  of 
a  Markov  chain. 


2  Markov  Chain  Methods 

Markov  chain  methods  generate  a  sample  path  from 
a  Markov  chain  that  has  ir  as  its  stationary  distribu¬ 
tion.  Recent  work  of  Gelfand  and  Smith  (1990)  on 
the  Gibbs  sampling  algorithm  has  renewed  interest  in 
Markov  chain  methods  for  exploring  posterior  distribu¬ 
tions.  Gelfand  and  Smith  extend  the  Gibbs  sampling 
algorithm  of  Geman  and  Geman  (1984),  originally  de¬ 
veloped  for  Bayesian  image  reconstruction,  to  continu¬ 
ous  distributions  and  show  how  the  algorithm  can  be 
used  in  a  wide  variety  of  problems. 

Markov  chain  methods  have  a  long  history  in  Mathe¬ 
matical  physics  dating  back  to  the  algorithm  of  Metropo¬ 
lis  c<  al.  (1953).  The  Metropolis  algorithm  is  in  fact  a 
general  class  of  algorithms  that  includes  versions  of  the 
discrete  Gibbs  sampler  as  special  cases. 
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2.1  The  Metropolis  Algorithm 

Metropolis  ei  al.  (1953)  originally  proposed  the  al¬ 
gorithm  now  known  as  the  Metropolis  algorithm  as  a 
method  of  sampling  from  the  equilibrium  distribution 
of  an  interacting  particle  system.  The  algorithm,  which 
is  described  in  Hammersley  and  Handscomb  (1964,  Sec¬ 
tion  9.3)  and  Ripley  (1987,  Section  4.7),  was  extended  by 
Hastings  (1970)  and  explored  further  by  Peskun  (1973). 

To  define  Hastings  version  of  the  algorithm,  let  Q  be 
a  Markov  transition  kernel  with 


Q{x,dy)  =  q(x,y)fi(dy). 


Let  £■+  =  {x  :  7r(x)  >  0},  and  assume  Q{x,  £"•■)  =  1  for 
X  ^  .  Then  define 


a(x,  y)  =  min 


f  Hyhjy,^)  A 

U{x)q{x,yy  I 


for  7r(x)g(x,y)  >  0.  Otherwise,  a(x,y)  =1.  If  the 
Markov  chain  is  currently  at  X~i  =  x,  then  the  algo¬ 
rithm  generates  a  candidate  Y  =  y  for  the  next  state 
from  Q{x,  •).  With  probability  a(x,y)  this  candidate  is 
accepted  and  the  chain  moves  to  X„+i  =  y.  Other¬ 
wise,  the  candidate  is  rejected  and  the  chain  remains  at 
.^n+l  — 

Since 


7r(x)q(x,y)a(x,y)  =  ir(y)9(y,  x)Q(y,  x), 

a  Metropolis  chain  with  initial  distribution  ir  is  re¬ 
versible.  Therefore  is  an  invariant  distribution  for  the 
chain.  Some  additional  conditions  on  x  and  Q  are  needed 
to  insure  that  ir  is  also  a  limiting  distribution;  these  con¬ 
ditions  are  discussed  in  Section  3  below.  Since  the  accep¬ 
tance  probability  only  depends  on  ir  through  the  ratio 
x(y)/7r{x),  the  density  n  only  needs  to  be  specified  up 
to  a  constant  of  proportionality. 

If  q{x,y)  =  q(y,x),  i.e.  q  is  symmetric,  then  the  ac¬ 
ceptance  probability  a(x,y)  simplifies  to 


This  is  the  original  form  of  the  algorithm  proposed  by 
Metropolis  et  al.  (1953).  Other  forms  of  the  rejection 
probability  are  possible,  but  the  form  given  here  can 
be  shown  to  be  optimal  within  a  wide  class  of  possible 
alternative  forms  (Peskun,  1973). 

The  Metropolis  algorithm  is  actually  a  class  of  algo¬ 
rithms.  Each  different  choice  of  the  kernel  Q  for  gen¬ 
erating  candidate  steps  produces  a  different  version  of 
the  algorithm.  Several  cleisses  of  kernels  appear  to  be 
particularly  useful  for  examining  posterior  distributions. 


2.1.1  Random  Walk  Chains 

For  E  =  IR*  and  /  a  density  on  E,set  Y  =  x  +  Z,  with 
Z  drawn  independently  from  /.  Then 

<}ix,y)  =  fiy-x). 

Thus  the  kernel  Q  driving  the  Metrolopis  chain  is  a  ran¬ 
dom  walk.  Natural  choices  of  /  are  normal,  uniform,  and 
t  distributions.  Split-f  distributions  (Geweke,  1989)  may 
also  be  useful.  The  scale  matrix  for  /  can  be  taken  as  a 
constant  c  times  the  inverse  information  at  the  posterior 
mode.  Good  choices  for  the  step  size  constant  c  are  still 
an  open  problem,  but  c  =  1  and  c  =  1/2  seem  to  work 
reasonably  well  in  a  number  of  examples. 

If  /  is  symmetric  about  the  origin,  i.e.  if  f(z)  = 
/(— x),  then  q  is  symmetric  and  the  simpler  form  of  the 
acceptance  probability  a(x,y)  can  be  used. 

2.1.2  Independence  Chains 

Suppose  /  is  a  density  on  £■,  and  we  generate  candidates 
y  independently  from  the  single  density  /.  Then 

9(x,y)  =  /(y). 

The  chain  of  candidates  driving  this  Metropolis  chain  is 
an  i.i.d.  sequence  from  the  density  /.  The  acceptance 
probability  for  an  independence  chain  can  be  written  as 

a(x,y)  =  min{^,l| 

for  w{x)  =  7r(x)//(x).  The  function  w  is  the  weight 
function  that  would  be  used  in  importance  sampling 
when  the  sample  is  generated  from  the  density  /. 

There  are  a  number  of  similarities  between  an  indepen¬ 
dence  chain  and  the  corresponding  importance  sampling 
process.  While  an  independence  chain  does  not  require 
explicit  use  of  the  weights,  it  will  rarely  accept  candi¬ 
dates  with  low  weights.  On  the  other  hand,  a  candidate 
with  high  weight  will  almost  always  be  accepted.  Fur¬ 
thermore,  when  the  chain  reaches  a  point  x  with  high 
weight  w(x),  it  will  usually  remain  there  for  several  it¬ 
erations,  thus  building  up  weight  on  x  within  the  sam¬ 
ple  path  by  repetition.  Another  similarity  to  importance 
sampling  is  that  the  sample  sequence  is  closer  to  an  i.i.d. 
sequence  from  n  the  closer  the  weight  function  lu  is  to  a 
constant. 

Because  of  these  similarities  to  importance  sampling, 
it  is  reasonable  to  conjecture  that  guidelines  developed 
for  choosing  importance  sampling  densities  also  apply 
to  choosing  densities  for  driving  independence  chains. 
In  particular,  it  is  advisable  to  choose  a  density  with 
thicker  tails  than  tt  and  thus  a  bounded  weight  function 


Exploring  Posterior  Distributions  565 


w.  Families  like  the  split-t  that  produce  good  impor¬ 
tance  sampling  densities  are  likely  to  be  good  choices  for 
independence  chains. 

2.1.3  Rejection  Sampling  Chains 

An  interesting  special  case  of  an  independence  chain  oc¬ 
curs  when  the  density  /  is  sampled  using  rejection  sam¬ 
pling.  In  attempting  to  use  rejection  sampling  to  sample 
directly  from  tt,  we  use  a  density  h  and  a  constant  c  such 
that,  hopefully,  7r(x)  <  ch{x).  If  we  repeat  the  process  of 
sampling  Z  from  h  and  then  U  uniformly  from  [0,  ch(Z)], 
until  U  <  TflZ),  then  the  final  value  of  Z  has  density 

f{x)  oc  7r(x)  A  ch(x). 

If  we  do  indeed  have  it(x)  <  ch{x),  then  /  is  proportional 
to  T  and  we  obtain  an  i.i.d.  sample  from  ir.  But  it  is 
very  difficult  to  insure  that  c  is  large  enough  for  ch  to 
dominate  n  without  choosing  c  excessively  large,  leading 
to  an  inefficient  algorithm  with  many  rejections.  And 
even  then  without  extensive  analysis  of  the  tails  of  h 
and  TT  we  cannot  be  certain  that  ch  does  dominate  n. 

Fortunately,  using  this  rejection  scheme  to  drive  an  in¬ 
dependence  Metropolis  chain  provides  a  simple  remedy. 
If  we  do  have  7r(x)  <  ch{x)  for  all  x,  then  the  weight 
function  u;  is  a  constant,  no  candidates  are  rejected,  and 
the  rejection  process  produces  an  i.i.d.  sequence  from  tt 
that  is  simply  passed  through  the  Metropolis  algorithm 
unchanged.  But  if  ch  does  not  dominate  ir  for  some  x, 
then,  when  the  chain  reaches  such  an  x,  the  Metropo¬ 
lis  algorithm  will  occasionally  reject  candidate  steps  in 
order  to  build  up  mass  on  this  x  to  make  up  for  the 
deficiency  in  the  envelope  ch.  This  introduces  some  de¬ 
pendence,  but  insures  that  the  equilibrium  distribution 
of  the  sample  path  is  tt  even  if  the  envelope  is  deficient. 

2.2  Combining  Strategies 

The  Gibbs  sampler  and  the  Metropolis  algorithms  de¬ 
scribed  above  provide  a  number  of  Markov  chain  strate¬ 
gies.  In  addition  to  choosing  any  one  of  these  strategies 
and  using  it  in  its  pure  form,  it  is  possible  to  form  hybrid 
strategies. 

Suppose  Pi, .. .,  Pm  are  Markov  kernels  with  invari¬ 
ant  distribution  tt.  Two  simple  ways  of  combining  these 
kernels  is  eis  a  mixture  or  a  cycle.  In  a  mixture,  proba¬ 
bilities  ai, . . . ,  dm  are  specified,  and  at  each  step  one  of 
the  kernels  is  selected  according  to  these  probabilities. 
In  a  cycle,  each  kernel  is  used  in  turn,  and  when  the  last 
one  is  used  the  cycle  is  restarted. 

Both  strategies  can  be  used  in  several  ways.  For  ex¬ 
ample,  a  Gibbs  sampler  can  be  combined  with  occasional 


steps  from  an  independence  chain  in  a  mixture  or  a  cycle 
to  “restart”  the  Gibbs  sampler  and  thus  reduce  correla¬ 
tions  while  preserving  the  equilibrium  distribution.  As 
another  example,  suppose  0  can  be  split  into  two  compo¬ 
nents  (01 , 02),  and  direct  sampling  from  0i  '02  is  possible 
but  direct  sampling  from  02\0i  is  not  possible.  Such  a 
situation  is  considered  by  Zeger  and  Karim  (1991).  Then 
“Gibbs  steps”  for  0i \02  can  be  combined  with  Metropolis 
steps  for  02\0\  in  a  mixture  or  a  cycle. 

3  Some  Theoretical  Results 

Whatever  approach  is  used  to  produce  a  Markov  chain 
with  invariant  distribution  ir,  before  the  chain  can  be 
used  with  confidence  to  generate  samples  for  examining 
TT  certain  theoretical  questions  need  to  be  addressed.  An¬ 
swers  to  some  of  these  questions  can  be  obtained  using 
some  recent  developments  in  general  state  space  Markov 
chain  theory  as  described,  for  example,  in  Nummelin 
(1984).  This  section  outlines  this  approach.  A  more 
complete  discussion  is  given  in  Tierney  (1991). 

3.1  Convergence 

The  first  question  to  be  addressed  is  whether  the  invari¬ 
ant  distribution  n  is  also  the  equilibrium  distribution  for 
the  chain,  i.e.  whether  the  distribution  of  the  chain  af¬ 
ter  n  iterations  converges  to  n.  In  discrete  state  space 
Markov  chain  theory,  two  conditions  are  needed;  irre- 
ducibility  and  aperiodicity.  The  same  is  true  in  general 
state  space  theory.  Periodicity  for  general  state  spaces 
can  be  defined  in  much  the  same  way  as  for  discrete 
spaces.  The  concept  of  irreducibility  is  a  little  more  com¬ 
plicated,  since  individual  states  are  usually  not  hit  with 
positive  probability.  It  is  therefore  necessary  to  speak  of 
irreducibility  with  respect  to  a  measure.  In  the  present 
context,  a  natural  choice  for  this  measure  is  tt  itself.  We 
will  therefore  say  that  a  Markov  chain  is  "-irreducible 
if  for  every  set  A  with  ;r(j4)  >  0  the  probability  of  the 
chain  ever  entering  A  is  positive  for  every  starting  point 
X  of  the  chain. 

Irreducibility  and  aperiodicity  need  to  be  verified  for 
each  Markov  chain.  Some  useful  sufficient  conditions  are 
available  for  certain  Metropolis  chains.  For  example,  a 
random  walk  chain  is  7r-irreducible  and  aperiodic  if  the 
increment  density  is  positive  on  a  neighborhood  of  the 
origin  and  the  density  tt  is  positive  on  all  of  IR*.  An 
independence  chain  is  7r-irreducible  and  aperiodic  if  the 
candidate  generation  density  /  is  positive  whenever  the 
density  n  is  positive. 

If  a  chain  with  invariant  distribution  ;r-irreducible  and 
aperiodic,  then  it  can  be  shown  that  the  chain  must  be 
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tive  recurrent  and  that  for  ir-almost  all  x, 

re  II  •  II  denotes  the  total  variation  distance  and  P" 
tie  distribution  after  n  steps  of  the  chain  started  at 

'  the  chain  is  Harris  recurrent,  then  this  convergence 
urs  for  all  x.  The  definition  of  Harris  recurrence  is 
lewhat  technical,  but  a  simple  sufficient  condition  is 
ilable  that  is  satisfied  by  all  ir-irreducible  Metropolis 
ins  and  essentially  all  5r-irreducible  Gibbs  samplers, 
l  TT-irreducible  aperiodic  Markov  chain  with  invari- 
distribution  tt  is  called  ergodic  if  it  is  aperiodic  and 
itive  Harris  recurrent. 

I  Rates  of  Convergence 

ce  we  know  that  the  distribution  of  a  chain  converges 
T,  the  next  question  is  to  determine  the  rate  of  conver- 
ice.  The  theory  presented  in  Nummelin  (1984)  pro- 
es  several  classifications  for  rates  of  convergence  of 
;odic  chains; 

igree  2:  If  a  chain  is  ergodic  of  degree  2,  then 
nil  P'*(x,.)-;r(.)||-.0 
for  5r-almost  all  x. 

jometric:  An  ergodic  chain  is  geometrically  ergodic 
if  II  P’'{x,  ■)  —  7r(-)  ||<  A/(x)r"  for  some  r  <  1  and 
some  function  M  with  J  Mdir  <  oo. 

liform:  An  ergodic  chain  is  called  uniformly  ergodic 
if  II  P"(x,  ■)  — ^(•)  ||<  Mr"  for  some  r  <  1  and  some 
constant  M . 

Uniform  ergodicity  is  the  strongest  of  these  forms  of 
ivergence  and  it  is  the  eaisiest  form  to  work  with.  A 
:essary  and  sufficient  condition  for  a  chain  with  kernel 
to  be  uniformly  ergodic  is  that  there  exist  a  probabil- 
u,  a  constant  /?  >  0  and  an  integer  n  >  1  such  that 
4)  <  P"(x,  A)  for  all  A  and  x.  Using  this  condition,  it 
possible  to  derive  a  variety  of  sufficient  conditions  for 
iform  ergodicity.  For  example,  if  n{E)  <  oo  and  the 
isities  q  and  ir  are  bounded  and  bounded  away  from 
o,  then  the  corresponding  Metropolis  kernel  is  uni- 
mly  ergodic.  As  another  example,  an  independence 
tropolis  kernel  is  uniformly  ergodic  if  the  weight  func- 
n  u;(x)  is  bounded. 

This  condition  can  also  be  used  to  derive  conditions 
uniform  ergodicity  of  hybrid  kernels  in  terms  of  con- 
ions  on  the  component  kernels.  For  mixtures  the  con- 
ion  is  particular!'  Umple;  if  P  is  uniformly  ergodic, 


then  any  mixture  using  P  with  positive  probability  is 
uniformly  ergodic.  For  cycles  a  slightly  more  compli¬ 
cated  condition  appears  to  be  needed:  if  P  is  used  in 
a  cycle  and  there  exists  a  probability  i/  and  a  constant 
/?  >  0  such  that  ^'(.4)  <  P(x,  A)  for  all  A  and  x,  then  the 
cycle  is  uniformly  ergodic.  This  condition  is  satisfied  if  P 
is  an  independence  kernel  with  a  bounded  weight  func¬ 
tion.  Combining  such  a  kernel  in  a  mixture  or  a  cycle 
with  any  other  kernel,  such  as  a  Gibbs  kernel,  therefore 
insures  that  the  hybrid  chain  is  uniformly  ergodic.  This 
provides  theoretical  support  for  using  occasional  inde¬ 
pendence  “restart”  steps  together  with  a  Gibbs  sampler 
to  improve  the  properties  of  the  sampler. 

3.3  Limiting  Behavior  of  Averages 

In  Markov  chain  methods,  sample  path  averages  are  used 
to  estimate  expectations  under  the  distribution  x.  A  law 
of  large  numbers  and  a  central  limit  theorem  insure  that 
these  estimates  converge  at  reasonable  rates.  The  law 
of  large  numbers  follows  from  the  ergodic  theorem  and 
needs  no  conditions  other  than  existence  of  the  expecta¬ 
tion  under  x; 

Law  of  Large  Numbers.  If  P  is  ergodic  with  invari¬ 
ant  distribution  it,  and  7r|/|  <  oo,  then  for  any  initial 
distribution 

/„  =  -  ^/(A'i)  -*  tt/  =  [  f{x)7T{dx) 
almost  surely. 

A  central  limit  theorem  does  appear  to  require  some 
conditions  on  the  rate  of  convergence  of  the  chain: 

Central  Limit  Theorem.  If  P  is  ergodic  of  degree 
2  with  wP  =  TT,  and  f  is  bounded,  then  for  any  initial 
distribution  the  distribution  of 

\/nif„  -  TTf) 

converges  weakly  to  a  normal  distribution  with  mean  zero 
and  variance 

Weaker  but  more  complicated  sufficient  conditions  are 
available.  Expressions  for  the  asymptotic  variance  <t'(/) 
are  available  for  finite  E  (Peskun,  1973;  Kemeny  and 
Snell,  1976).  Other  expressions  involving  certain  hitting 
times  are  available  for  general  state  spaces  (Nummelin, 
1984).  These  expressions  do  not  appear  to  be  useful  for 
computing  the  asymptotic  variance. 
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4  Using  a  Markov  Chain 

Once  a  Markov  chain  strategy  with  satisfactory  theoret¬ 
ical  properties  hats  been  selected,  it  can  be  used  to  es¬ 
timate  numerical  characteristics  or  to  provide  graphical 
views  of  features  of  the  posterior  distribution. 

4.1  Numerical  Uses 

Using  Markov  chains  for  calculating  numerical  character¬ 
istics  of  a  posterior  distribution  is  in  principle  straight 
forward:  expectations  with  respect  to  tt  can  be  approx¬ 
imated  by  sample  path  averages.  There  are,  however, 
a  number  of  issues  that  need  to  be  considered  before 
running  a  chain. 

4.1.1  Choosing  a  Sampling  Plan 

The  first  issue  concerns  the  choice  of  a  sampling  plan. 
There  are  two  extreme  approaches.  Several  authors  have 
proposed  that  Markov  chains  should  be  used  to  generate 
n  independent  realizations  from  the  posterior  by  using 
n  separate  runs,  each  of  length  m,  and  retaining  the  fi¬ 
nal  states  from  each  chain.  The  run  length  m  is  to  be 
chosen  large  enough  to  insure  that  the  chain  has  reached 
equilibrium.  An  alternate  approach  is  to  use  a  single 
long  run,  or  perhaps  a  small  number  of  long  runs.  Ex¬ 
perience  and  theoretical  assessments  in  the  simulation 
literature  appear  to  favor  the  use  of  long  runs  (Bratley 
ei  ai,  1987,  Section  3.1.1;  Kelton  and  Law,  1984).  The 
major  drawback  of  using  short  runs  is  that  it  is  virtually 
impossible  to  tell  when  a  run  is  long  enough  based  on 
such  runs.  Even  using  long  runs,  determining  how  much 
of  the  initial  series  is  affected  by  the  starting  state  is  very 
difficult,  but  some  literature  on  the  subject  is  available 
(Ripley,  1987,  Section  6.1).  A  second  drawback  of  short 
runs  is  that  it  makes  inefficient  use  of  the  data:  only  n 
out  of  a  total  of  nm  data  points  are  used.  With  a  single 
run  of  length  nm  it  is  possible  to  use  all  the  data,  after 
possibly  discarding  a  small  initial  f  iction. 

A  complication  that  does  arise  from  the  dependence 
in  using  a  single  series  is  that  variances  of  estimates  are 
harder  to  ctssess.  Again  the  simulation  literature  offers 
several  alternatives,  such  as  the  use  of  batch  means  and 
time  series  analysis  (Bratley  et  al.,  1987,  Chapter  3;  Rip¬ 
ley,  1987,  Chapter  6).  For  some  purposes  it  may  never¬ 
theless  be  useful  to  have  an  approximate  independent 
sample  from  the  posterior.  Using  long  runs  this  can 
be  achieved  by  retaining  every  r-th  point  of  a  sample 
path.  The  number  r  of  points  to  skip  in  order  to  pro¬ 
duce  approximate  independence  can  usually  be  chosen 
much  smaller  than  the  number  m  of  teps  needed  to 
reach  approximate  equilibrium,  since  small  amounts  of 


correlation  are  usually  much  less  serious  than  bieuses  in 
estimates  of  means. 

4.1.2  Determining  the  Run  Length 

Another  consideration  is  to  determine  the  total  sample 
size  or  run  length  required  for  accurate  estimates.  For 
an  i.i.d.  sample  of  size  n,  the  standard  deviation  of  the 
sample  mean  of  a  function  f(9)  is  <rfy/n,  where  <r  is  the 
posterior  standard  deviation  of  f{6).  If  a  preliminary 
estimate  of  a  is  available,  perhaps  from  an  asymptotic 
analysis,  then  this  can  be  used  to  estimate  the  sample 
size  that  would  be  required  in  i.i.d.  sampling.  In  de¬ 
pendent  sampling,  observations  are  generally  positively 
correlated  and  a  larger  simple  size  will  be  required.  If  the 
series  is  modeled  as  a  first  order  autoregressive  process, 
then  the  standard  deviation  of  the  sample  mean  is 

_f_  /I±Z 

v^V  1  -p 

where  again  a  is  the  posterior  standard  deviation  of  f(6) 
and  p  is  the  autocorrelation  of  the  series  /(A'„).  A  rough 
estimate  of  p  can  thus  be  used  to  adjust  the  sample  size 
for  dependence  in  the  series. 

Instead  of  determining  a  fixed  sample  size  in  advance, 
it  is  also  possible  to  use  sequential  or  batch  sequential 
rules  for  determining  when  to  stop  sampling.  Since  prior 
information  on  the  values  of  the  posterior  mean  and 
standard  deviation  is  often  available  form  initial  anal¬ 
yses,  Bayesian  sequential  methods  are  a  natural  choice. 
Batching  can  be  used  to  insure  that  an  assumption  of 
normality  for  batched  means  is  reasonable. 

One  sequential  approach  that  should  be  avoided  is  to 
plot  successive  sample  means  and  stop  sampling  when 
the  means  appear  to  have  converged.  Since  sample 
means  ch  nge  by  increments  on  the  order  of  0(n~*)  but 
errors  are  of  order  0(n~*/^),  this  approach  will  produce 
sample  sizes  that  are  too  small.  The  presence  of  positive 
correlations  in  Markov  chain  series  makes  these  series 
appear  to  have  converged  even  earlier,  even  though  the 
correlations  imply  that  errors  are  larger  and  thus  larger 
sample  sizes  are  required  than  with  i.i.d.  sampling. 

4.1.3  Numerical  Issues 

Some  consideration  of  numerical  stability  is  needed  in 
using  any  sampling  based  method.  Expressions  used  to 
evaluate  log  posterior  densities  obtained  by  translating 
mathematical  formulas  into  a  computer  language  are  of¬ 
ten  reasonably  stable  near  the  posterior  mode  but  not 
far  away  from  the  posterior  mode.  This  can  lead  to  over¬ 
flows  or,  on  IEEE  hardware,  results  that  are  NAN’s  or 
INF’s.  One  way  to  avoid  these  problems  is  to  carefully 
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y  the  formuleis  for  evaluating  the  log  posterior  den- 
and  modify  them  to  be  numerically  stable  even  for 
;me  parameter  values.  The  effort  required  to  do  this 
be  considerable.  An  expedient  alternative  that  is  of- 
jffective  is  to  truncate  the  parameter  space  to  a  rea- 
ible  range  that  contains  essentially  all  the  posterior 
>ability  and  for  which  the  posterior  density  formula 
jmerically  stable.  This  truncation  also  often  insures 
a  Markov  chain  used  to  sample  from  tt  is  uniformly 
idic  and  thus  improves  the  behavior  of  the  Markov 
n  estimates. 

he  need  to  allow  truncation  is  an  important  consider- 
n  in  developing  software  for  implementing  sampling 
;d  methods.  Subroutines  must  allow  for  user  sup- 
d  range  test  functions  or  allow  the  results  returned 
the  log  posterior  subroutine  to  indicate  a  parameter 
t  is  outside  of  the  range. 

L  numerical  issue  that  is  unique  to  Markov  chain 
;hods  is  the  possibility  that  rounding  may  introduce 
orbing  states.  If  this  happens,  results  obtained  from  a 
rkov  chain  method  may  be  meaningless.  Again  trun- 
ion  away  from  areas  of  the  state  space  where  such 
nding  may  occur  can  be  helpful. 

.4  Variance  Reduction 

with  any  simulation  method,  variance  reduction  tech- 
ues  can  often  significantly  reduce  the  sample  sizes  re- 
red  for  accurate  estimates.  Standard  variance  reduc- 
1  methods  such  as  importance  sampling,  antithetic 
iates,  conditioning,  and  control  variates  (Bratley  ei 
1987,  Chapter  2;  Ripley,  1987,  Chapter  5)  can  be 
d  with  any  Markov  chain  method, 
mportance  sampling  can  be  used  as  a  variance  reduc- 
1  method  by  using  a  Markov  chain  with  equilibrium 
[.ribution  /  instead  of  tt  and  then  weighting  sample 
jlts  with  appropriate  importance  weights.  Condition- 
is  often  useful  in  Gibbs  samplers,  since  the  assump- 
is  required  for  the  Gibbs  sampler  imply  that  condi- 
lal  means  or  densities  of  one  parameter  given  the  rest 
usually  available.  Gelfand  and  Smith  ( 1990)  refer  to 
5  use  of  conditioning  as  Rao-Blackwellization. 
Antithetic  variation  can  be  introduced  into  a  Markov 
in  method  by  using  a  Metropolis  step  in  which  a  can- 
ate  step  is  obtained  by  reflecting  the  current  state  of 
chain  through  a  point.  If  the  posterior  density  is  ap- 
ximatcly  symmetric  about  this  point,  then  the  sam- 
will  be  also,  and  the  resulting  negative  correlations 
I  reduce  variances  of  estimates  of  linear  functions  of 
This  technique  can  also  be  used  to  take  advantage 
ipproximate  axial  symmetries  in  a  posterior  distribu- 
1. 

)ne  way  to  introduce  control  variates  into  a  Markov 


chain  method  is  to  use  the  sample  path  with  importance 
weights  to  calculate  estimates  of  normal  approximations 
and  to  correct  for  the  errors  in  these  estimates. 

4.1.5  Monitoring  Sampler  Performance 

In  using  Markov  chain  methods,  it  is  important  to  mon¬ 
itor  the  performance  of  the  samplers  to  insure  that  they 
are  not  exhibiting  any  unusual  behavior.  Gelfand  and 
Smith  (1990)  propose  the  use  of  quantile  plots  to  moni¬ 
tor  performance.  Monitoring  sample  paths  of  estimates 
is  also  useful  for  this  purpose,  as  is  monitoring  autocor¬ 
relations  of  the  parameters.  Adaptive  time  series  models 
may  also  be  useful  for  determining  whether  a  series  ex¬ 
hibits  any  unusual  features. 

For  Metropolis  chains  it  is  also  important  to  keep  track 
of  the  the  number  of  candidates  that  are  rejected.  For 
an  independence  chain,  the  proportion  of  rejections  can 
be  related  to  the  total  variation  distance  between  the 
posterior  density  k  and  the  candidate  generation  density 
/• 

By  monitoring  the  performance  of  a  sampler,  in  par¬ 
ticular  in  the  early  stages,  it  is  possible  to  experiment 
with  different  setting  for  sampler  parameters  to  obtain 
samplers  that  are  efficient  for  a  particular  problem.  More 
work  is  needed  to  find  good  strategies  for  making  such 
parameter  adjustments. 

4.2  Graphical  Uses 

Numerical  summaries,  such  ais  posterior  means,  standard 
deviations,  marginal  densities,  and  correlations,  provide 
insight  into  the  uncertainty  about  one  or  perhaps  two 
features  of  ^  at  a  time.  For  understanding  uncertainty 
in  higher  dimensions  graphical  methods  may  be  more 
useful  than  numerical  summaries. 

4.2.1  Plotting  Samples 

For  three-dimensional  quantities,  one  useful  graphical 
method  available  on  microcomputers  and  workstations 
with  bitmapped  displays  is  a  rotatable  three-dimensional 
scatterplot.  By  selecting  every  r-th  entry  in  a  Markov 
chain  sample  path  we  can  obtain  an  approximate  ’.i.d. 
sample  from  the  posterior  distribution  and  display  this 
sample  in  a  rotatable  scatterplot.  Three-dimensional 
structures  will  readily  become  apparent  as  the  point 
cloud  of  the  sample  is  rotated. 

Rotatable  .scatterplots  are  only  useful  for  examining 
three  dimensions  at  a  time.  A  method  that  may  be  use¬ 
ful  for  higher  dimensions  is  the  Grand  Tour.  Again  an 
approximate  i.i.d.  sample  can  be  selected  and  displayed 
in  a  Grand  Tour.  Implementations  of  the  Grand  Tour 
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Figure  1:  Posterior  mean  of  a  response  function.  Figure  2:  A  second  response  function  supported  by  the 

posterior  distribution. 


are  only  now  becoming  widely  available,  so  extensive  ex¬ 
perience  with  this  method  is  not  yet  available.  Early 
results  suggest  that  this  method  is  reasonably  effective 
for  detecting  structures  in  four  to  six  dimensions. 

4.2.2  Controlling  Animations 

If  9  is  more  than  five-  or  six-dimensional,  then  it  may  be 
difficult  enough  to  understand  6  itself,  much  less  uncer¬ 
tainty  about  9.  If  a  graphical  view  of  9  is  available  that 
is  meaningful  for  particular  values  of  9,  then  one  way  of 
developing  an  understanding  of  the  uncertainty  about  9 
is  to  look  at  an  animated  version  of  the  graph  in  which 
9  is  moved  through  a  variety  of  values  that  are  plausible 
under  the  posterior  distribution. 

As  an  example,  suppose  we  have  a  smooth  response 
function  0  of  a  real  variable  x  in  some  interval  I  that  is 
measured  with  error.  Thus  we  obtain  measurements  of 
the  form 

Y  =  9ix)  +  t. 

Our  prior  opinion  on  the  function  9  suggests  that  this 
function  is  smooth,  but  does  not  suggest  any  particular 
parametric  structure. 

Several  approaches  are  available  for  specifying  such  a 
prior  distribution.  Most  involve  choosing  a  prior  on  co¬ 
efficients  in  some  representation,  such  as  a  power  series 
or  spline.  The  coefficients  of  these  representations  are 
not  likely  to  be  particularly  meaningful.  But  a  plot  of 
the  response  function  9  over  the  interval  /  is  readily  un¬ 
derstood.  Figure  1  shows  a  plot  of  the  posterior  mean  of 


9  for  a  particular  example.  This  mean  exhibits  a  number 
of  features,  such  as  a  pronounced  global  minimum  and 
a  secondary  local  minimum.  Are  these  features  really 
present  in  9  or  are  they  merely  artifacts  of  the  poste¬ 
rior  mean?  One  way  to  answer  this  question  is  to  look 
at  other  functions  9  that  are  supported  by  the  posterior 
distribution.  This  can  be  done  by  running  an  animation 
that  shows  graphs  of  different  values  of  9. 

To  provide  a  good  understanding  of  the  posterior  dis¬ 
tribution,  an  animation  needs  to  visit  all  areas  supported 
by  the  posterior.  In  addition,  to  allow  the  user  to  keep 
track  of  the  changes  in  0  as  it  moves  through  the  poste¬ 
rior  distribution,  the  animation  has  to  move  smoothly 
These  objectives  can  be  achieved  using  a  random  walk- 
driven  Metropolis  chain  with  the  posterior  distribution 
as  its  equilibrium  distribution.  Using  the  posterior  as 
the  equilibrium  insures  that  the  chain  does  eventually 
approach  all  possible  values  of  9  but  spends  most  of  its 
time  near  values  that  are  better  supported  by  the  pos¬ 
terior  distribution.  The  correlation  in  the  random  walk 
insures  that  the  chain  moves  in  small  steps,  thus  provid¬ 
ing  the  visual  continuity  that  is  necessary  for  an  effective 
animation.  Thus  the  correlations  in  the  Metropolis  chain 
that  are  a  nuisance  for  numerical  computations  are  in 
fact  an  advantage  for  this  graphical  application.  Conti¬ 
nuity  can  be  further  enhanced  by  interpolating  between 
steps  of  the  random  walk. 

Figure  2  shows  another  view  of  the  animation.  View¬ 
ing  the  animation  for  this  particular  example  for  a  few 
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minutes  quickly  reveals  that  the  global  minimum  is  quite 
well  defined  but  the  shape  of  the  left  half  of  the  curve  is 
very  uncertain. 

A  useful  enhancement  for  this  animation  is  the  bar 
shown  at  the  left  of  the  two  plots.  The  solid  part  of  the 
bar  represents  the  probability  content  in  the  posterior 
at  or  below  the  level  of  the  current  6,  computed  using  a 
approximation.  This  gives  a  quick  indication  of  how 
plausible  the  current  view  is. 

Many  variations  on  this  animation  are  possible.  For 
example,  using  the  posterior  distribution  as  the  equilib¬ 
rium  of  the  driving  Markov  chain  is  a  reasonable  starting 
point  but  is  not  essential.  At  times  it  may  be  useful  to 
force  the  chain  to  concentrate  its  motion  closer  to  the 
mode,  or  to  move  farther  away  from  the  mode  and  pos¬ 
sibly  find  interesting  features  that  are  farther  away.  This 
can  be  accomplished  by  using  a  Markov  chain  with  an 
equilibrium  density  that  is  a  power  of  the  posterior  den¬ 
sity  -  by  “cooling”  or  “heating”  the  posterior  distribu¬ 
tion  in  the  terminology  of  simulated  annealing. 

Much  additional  work  is  needed  to  explore  ways  of 
merging  numerical  methods  such  as  the  ones  described  in 
this  paper  with  new  computing  hardware  that  is  now  be¬ 
coming  more  widely  available.  The  animation  described 
here  is  a  first  step  in  that  direction. 
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Abstract 

The  construction  and  implementation  of  a  Gibbs 
sampler  for  efficient  simulation  from  the  truncated 
multivariate  normal  and  Student-t  distributions  is  described. 
It  is  shown  how  the  accuracy  and  convergence  of  integrals 
based  on  the  Gibbs  sample  may  be  constnicted. 

KEYWORDS:  Bayesian  inference;  Gibbs  sampler; 
Monte  Carlo;  multiple  integration;  truncated  normal 

1  Introduction 

The  generation  of  random  samples  from  a  truncated 
multivariate  normal  distribution,  that  is,  a  multivariate 
normal  distribution  subject  to  multiple  linear  inequality 
restrictions,  is  a  recurring  problem  in  the  evaluation  of 
integrals  by  Monte  Carlo  methods  in  econometrics  and 
statistics.  Sampling  from  a  truncated  multivariate  Student-t 
distribution  is  a  closely  related  problem.  The  problem  is 
central  to  Bayesian  inference,  where  a  leading  example  is  the 
normal  linear  regression  model  subject  to  linear  inequality 
restrictions  on  the  coefficients  (Geweke,  1986).  But  it  also 
arises  in  classical  inference,  when  integrals  enter  the 
likelihood  function;  McFadden  (1989)  has  proposed  the  use 
of  Monte  Carlo  integration  in  one  such  instance. 

Recently  several  promising  solutions  of  this  problem 
have  been  investigated.  A  survey  of  methods  along  with 
several  contributions  is  provided  in  Hajivassiliou  and 
McFadden  (1990).  One  of  these  methods,  the  Gibbs 
sampler,  is  especially  well  suited  to  the  problem.  It  uses  a 
simple  algorithm  that  generates  samples  with  great 
computational  efficiency,  but  at  the  cost  of  introducing  two 
complications.  First,  the  drawings  are  not  independent, 
which  complicates  the  evaluation  of  the  accuracy  of  the 
approximation  using  standard  methods  like  those  proposed 
in  Geweke  (1989).  Second,  the  distribution  from  which  the 
drawings  are  made  converges  to  the  truncated  multivariate 
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normal  distribution,  but  is  not  identical  with  it  at  any 
stage. 

In  this  paper  we  contribute  to  the  resolution  of  these 
problems.  The  ability  to  generate  variates  from  a  truncated 
univariate  normal  distribution  is  a  central  building  block  in 
the  solution  of  the  more  general  problem.  Section  2 
describes  an  algorithm  for  the  generation  of  variates  from  a 
truncated  univariate  normal  distribution  that  is  substantially 
more  efficient  and  flexible  than  the  method  that  has  been 
favOTed  in  the  past  Drawing  from  the  truncated  multivariate 
normal  distribution  with  the  Gibbs  sampler  is  fully 
developed  in  Section  3,  including  the  evaluation  of  the 
accuracy  of  the  numerical  approximation  and  construction  of 
diagnostics  for  convergence.  These  methods  are  extended  to 
the  multivariate  Student-t  distribution  in  Section  4. 

Throughout  the  paper  some  standard  but  not  universal 
notation  is  employed.  The  univariate  normal  probability 
density  function  is  (])(•).  the  corresponding  cumulative 
distribution  function  is  <!>(-).  and  the  inverse  cumulative 
distribution  function  is  4>'^(').  The  uniform  distribution 
on  the  interval  [a,  b]  is  denoted  U[a,  b].  The  univariate 
truncated  normal  distribution  TN(a,  b)  is  the  univariate 
normal  restricted  to  (a,  b):  its  density  is 

[«I>'^(b)  -  «I>'Ha)]'*(K  )  on(a,  b)  and  0  elsewhere;  a  =  -<» 
and  b  =  -H»  are  permitted  special  cases. 

2  The  mixed  rejection  algorithm 
for  truncated  univariate 
normal  sampling 

All  of  the  methods  described  in  this  paper  assume  the 
ability  to  draw  i.i.d.  samples  from  a  truncated  univariate 
normal  distribution.  It  is  well  recognized  that  rejection 
sampling  from  a  univariate  normal  distribution  is 
impractical.  Inverse  c.d.f.  sampling  (Devroye,  1986)  is  a 
feasible  alternative.  If  x  ~  TN(a,  b),  then  x  =  «I)'*(u),  u 
~  U[<I>(a),  dKb)].  This  method  requires  the  evaluation  of  one 
integral  for  each  draw,  and  if  the  values  of  a  and  b  change 
with  the  draws,  then  three  evaluations  are  required.  The 
computation  of  O’^w)  requires  more  time  as  w 0  or 
w  ->  1,  and  the  double  precision  implementation  in  the 
IMSL/STAT  library  is  unable  to  compute  w  =  O'Hp)  if 
Iwl  >  8.  Here,  we  shall  suggest  a  different  algorithm,  whose 
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execution  times  are  substantially  smaller  than  inverse  c.d.f. 
sampling,  and  can  draw  x  ~  TN(a,  b)  for  any  a  <  b  so 
long  as  lal  <  35  and  Ibl  ^  35.  when  programme  in  double 
precision  (64-bit)  floating  point  arithmetic. 

The  algorithm  produces  i.i.d.  samples  from  TN(a.  b), 
including  the  cases  a  =  -<»  and  b  =  +<»,  It  employs  four 
different  kinds  of  rejection  sampling,  depending  on  the 
values  of  a  and  b.  In  normal  rejection  sampling,  x  is 
drawn  from  N(0,  1)  and  accepted  if  xe  [a,  b].  la  half¬ 
normal  rejection  sampling,  x  is  drawn  from  N(0,  1)  and 
Ixl  is  accepted  if  x  e  [a,  b]  (where  a  >  0).  In  uniform 
rejection  sampling,  x  is  drawn  from  Ufa,  b],  u  is  drawn 
independently  from  U(0.1),  and  x  is  accepted  if  u  < 
(|)(x)/<tKx*).  X*  =  argmax[a.b][<Kx)]. 

Exponential  rejection  sanyyling  is  key  to  the  algorithm, 
and  requires  description  in  some  detail.  The  motivating 
example  is  TN(a,  «>),  where  a  >  0,  and  possibly  <l>(a)  is 
close  to  1.  As  a -4  oo,  theTN(a,  <»)  distribution  comes 
to  resemble  the  exponential  distribution  as  detailed  in 
Geweke  (1986,  Appendix  A).  Suppose  z  is  drawn  from  an 
exponential  distribution  on  (a,  »)  with  kernel  exp(-A.z) 
for  z  >  a.  Consider  fixing  X  so  as  to  minimize  the 
probability  of  rejection.  The  acceptance  probability  must  be 
proportional  to  exp(-2Z^)/exp(-Xz),  for  z  6  [a.  «>). 
Computing  the  constants  of  proportionality,  we  find 
acceptance  |»obabilities 

expf-^z^  +  a^)]  exp(-Xz)  if  X  S  a, 
exp[-^z  -  X)2)]  if  X  >  a. 

The  first  expression  is  maximized  at  X  =  a  for  all  z. 
Integrating  the  second  expression  with  respect  to  the 
exponential  density  Xexp(-X(z-a)l  on  [a,  «>),  we  find  that 

the  acceptance  probability  is  X  exp(Xa  -  2X^)(2n)^/^(l  - 

0(a)].  This  is  maximized  when  X  =  ^a  +  (a^  +  4)*^^].  As 

a  X/a  -4  1,  and  the  acceptance  probability  converges 
to  unity.  Experimentation  within  the  context  of  the 
algorithm  presently  described  has  shown  that  the  increase  in 
computing  time  from  using  the  suboptimal  but  simpler 

choice  X  =  a,  is  less  than  the  time  required  to  compute 
^[a  +  (a^  +  4)1/2]  Hence  we  use  X  =  a  in  this 
^gorithm. 

The  algorithm  employs  four  constants  (ti,  i  =  1, .. ,  4) 
whose  values  have  been  set  through  experimentation  with 
computation  time.  The  selected  value  is  indicated  when  the 
constant  is  introduced.  The  sampling  procedure  depends  on 
the  relative  configuration  of  a  and  b,  as  follows.  Except 
in  case  (1),  a  and  b  are  finite. 

(1)  On(a,  oo):  normal  rejection  sampling  if  aSt4(=.45); 
exponential  rejection  sampling  if  a  >  14. 

(2)  On  (a,  b)  if  0  e  [a,  b]: 

(a)  If  <|)(a)  ^  ti  (=  .150)  or  ^(b)  <  t|:  normal 
rejection  sampling: 


(b)  If  (|)(a)  >  ti  or  (l)(b)  >  ti:  uniform  rejection 
sampling. 

(3)  On  (a,  b)  if  a  >  0: 

(a)  If  (Ka)/<J»(b)  ^  t2(=2.18):  uniform  rejection 
sampling; 

(b)  If  <l)(a)/<|>(b)  >  ti  and  a  <  13  (=  .725);  half¬ 
normal  rejection  sampling; 

(c)  If  ({»(a)/<)>(b)  >  ti  and  a  >  13:  exponential 
rejection  sampling. 

The  omitted  cases  (-<»,  b).  and  (a,  b)  with  b  <  0,  are 
symmetric  to  the  cases  (1)  and  (3),  respectively,  and  are 
treated  in  the  same  way.  Software  for  the  mixed  rejection 
algorithm  was  tested  by  comparing  the  distributions  of 
sampled  variates  produe^,  with  those  produced  by  inverse 
c.d.f.  sampling.  Each  was  programmed  in  double  precision 
Fortran-77  using  the  IMSL/STAT  library,  on  a  Sun 
Sparcstation  4/40  (IPC).  Computation  times  for  10,0(X) 
sampled  variates  are  shown  in  Table  1.  Times  for  the 
inverse  c.d.f.  algorithm  range  from  2.24  to  4.51  seconds, 
those  for  the  mixed  rejection  algorithm  from  0.67  to  1.28 
seconds.  On  a  case-by-case  basis  the  mixed  rejection 
algorithm  is  from  2.47  to  6.24  times  faster  than  the  inverse 
c.d.f.  algorithm. 

3  The  Gibbs  algorithm  for 

truncated  multivariate  normal 
sampling 

The  central  problem  addressed  in  this  paper  is  the 
construction  of  samples  from  an  n-variate  normal 
distribution  subject  to  linear  inequality  restrictions. 

X  -  N(p,Z).  a  <  Dx  <  b  (3.1) 

The  matrix  D  is  n  x  n  of  rank  n,  individual 
elements  of  a  may  be  -<»,  and  individual  elements  of  b 
may  be  +  «>.  This  accommodates  fewer  than  n  linearly 
independent  restrictions.  It  does  not  allow  more  than  n 
linearly  independent  restrictions,  and  the  method  set  forth 
here  cannot  be  extended  to  these  cases,  at  least  in  a  tidy  way. 
In  the  applications  described  in  the  introduction  the  truncat^ 
multivariate  normal  distribution  arises  in  the  form  (3.1). 
The  problem  is  equivalent  to  the  construction  of  samples 
from  the  n-variate  normal  distribution  subject  to  linear 
restrictions, 

z  -  N(0.  T),  a  <  z  <  p.  (3.2) 

where 

T  =  DID',  a  =  a  -  Dp,  P  =  b  -  Dp. 

and  we  then  take  x  =  p  +  D'^z. 

Several  approaches  to  the  solution  are  possible;  see 
Hajivassiliou  and  McFadden  (1990,  Appendix  B)  for  a  brief 
survey  of  these  methods,  and  Hajivassiliou,  McFadden,  and 
Ruud  (1990)  for  an  application  of  importance  sampling  to 
the  special  case  of  orthant  restrictions.  Naive  rejection 
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Table  1 

Comparison  of  Computation  Times 
Mixed  Rejection  and  Inverse  c.d.f.  Algorithms 
TN(a,  b]  Distribution* 


a; 

b: 

-8.0 

-5.0 

-3.0 

-2.0 

-1.0 

-0.5 

0.0 

0.5 

1.0 

2.0 

3.0 

5.0 

-5.0 

1.02 

4.52 

-3.0 

1.04 

1.04 

4.45 

4.45 

-2.0 

1.07 

1.10 

1.08 

4.44 

4.43 

4.49 

-1.0 

1.28 

1.26 

1.22 

1.21 

3.65 

3.67 

3.66 

3.62 

-0.5 

1.19 

1.19 

1.25 

1.26 

.93 

3.57 

3.69 

3.60 

3.55 

2.90 

0.0 

1.16 

1.16 

1.15 

1.19 

.75 

.71 

2.91 

2.91 

2.91 

2.89 

2.25 

2.33 

0.5 

.89 

.90 

.90 

.91 

.78 

.73 

.67 

3.55 

3.54 

3.58 

3.61 

2.92 

2.92 

2.24 

1.0 

.76 

.76 

.78 

.79 

.82 

.79 

.73 

.89 

3.54 

3.52 

3.53 

3.51 

2.89 

2.90 

2.24 

2.90 

2.0 

.68 

.69 

.70 

.70 

.97 

1.02 

1.23 

1.21 

1.18 

4.15 

4.16 

4.16 

4.14 

3.56 

3.52 

2.90 

3.56 

3.63 

3.0 

.71 

.69 

.69 

.69 

.90 

1.01 

1.19 

1.18 

1.18 

1.05 

4.24 

4.18 

4.19 

4.18 

3.57 

3.65 

2.94 

3.75 

3.71 

4.51 

5.0 

.69 

.68 

.69 

.69 

.89 

.98 

1.21 

1.15 

1.23 

1.05 

1.02 

4.22 

4.15 

4.17 

4.15 

3.52 

3.54 

2.92 

3.58 

3.64 

4.43 

4.45 

8.0 

.67 

.68 

.69 

.69 

.89 

1.01 

1.17 

1.15 

1.24 

1.04 

1.01 

.97 

4.18 

4.15 

4.17 

4.13 

3.53 

3.54 

1.91 

3.56 

3.63 

4.42 

4.45 

4.47 

a: 

-8.0 

-5.0 

-3.0 

-2.0 

-1.0 

-0.5 

0.0 

0.5 

1.0 

2.0 

3.0 

5.0 

*  Times  are  given  in  seconds,  for  drawing  samples  of  size  10,000.  Computations  were  performed  on  a  Sun  4/40  (IPC) 
workstation.  Software  was  written  in  double  precision  Fortran-77,  and  used  the  IMSL/STAT  Edition  10  routines  DRNNOF 
for  univariate  normal  generation,  DNORDF  for  evaluation  of  the  univariate  normal  c.d.f.,  and  DNORIN  for  evaluation  of  the 
univariate  normal  inverse  c.d.f. 


sampling  from  N(p,  £)  can  be  employed  directly  in  (3.1), 
but  is  impractical  in  general  since  the  ratio  of  rejected  to 
accepted  variates  is  astronomical  for  many  commonly 
arising  problems.  More  sophisticated  procedures  must  cope 
with  the  fact  that  the  marginal  distributions  of  the  elements 
of  z,  and  of  x,  are  not  univariate  truncated  normal.  The 
method  set  forth  here  exploits  the  fact  that  the  distribution 
of  each  element  of  z,  conditional  on  all  of  the  other 
elements  of  z,  is  truncated  normal.  This  method  has  also 
been  described  by  Hajivassiliou  and  McFadden  (1990),  but  as 
outlined  in  the  introduction  we  pursue  several  extensions 
here. 


The  algorithm  employed  is  the  Gibbs  sampler,  whose 
systematic  application  to  problems  of  this  form  dates  from 
(ieman  and  Geman  (1984);  see  also  Gelfand  and  Smith 
(1990).  The  general  problem  is  to  sample  from  a 
multivariate  density  f(x)  for  an  n-dimensional  random  vector 
X.  when  no  practical  algorithm  is  available  fm*  doing  so 
directly.  But  suppose  that  the  conditional  distributions, 

xi  I  {xj, ... ,  xi-i,  xj+i . Xn)  ~ 

fi(xi . xi-i,  xj+i . Xn)  (i=  1 . n) 

are  known,  and  are  of  a  form  that  synthetic  i.i.d.  random 
variables  can  be  generated  readily  and  efficiently  fiom  each  of 
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the  fi(-).  Let  x®'  =  (x® . x®)  be  an  arbitrary  point  in 

the  support  of  f(x).  Generate  successive  synthetic  random 
variables. 


X-  I  {X^j,  . 

fi(xi. 
(i=  1 


x^  xO 
,  n) 


(3.3) 


These  n  steps  constitute  the  first  pass  of  the  Gibbs 
sampler.  The  second  and  successive  passes  are  performed 
similarly.  At  the  i’th  step  of  the  j’th  pass. 


*^i-l’ 

xJ 

’  ^-1’ 


xi-l 
^1+1’ 
^j-1 

'i+1’ 


ri 


and  the  composition  of  the  vector  becomes 


)' 


at  the  end  of  this  step.  At  the  end  of  the  j'th  pass  the 
composition  of  the  vector  is 

xti)'  =  (x{ . xJ)'. 

Gelfand  and  Smith  (1990)  have  outlined  weak 
conditions  under  which  xCi)  converges  in  distribution  and 
has  limiting  distribution  give  by  the  density  f(x),  and  the 
rate  of  convergence  is  geometric  in  the  Li  norm.  These 
conditions  pertain  to  the  truncated  multivariate  normal 
density  in  (3.2).  The  conditional  densities  fi(-)  for  this 
problem  are  truncated  univariate  normal,  and  the  algorithm 
described  in  the  previous  section  may  be  used  to  generate  the 
required  successive  synthetic  random  variables.  Suppose 
that  in  the  non-truncated  distribution  N(0,  T), 


HfZj  I  Z],  ...  ,  Zi_i,  Zi.f.|,  ...  ,  Zfj]  = 

j’‘i 

Then  in  the  truncated  normal  distribution  of  (3.2),  the 

distribution  of  z,  conditional  on  {zj . zj.i,  Zi+i, ... , 

Zn)  has  the  construction. 


zi  =  XcijZj  +  hjej, 

Ei  ~  TN[(ai  -  LcijZj)/hi.  (Pi  -ZcijZj)/hi]. 

}*i 

Denote  the  vectors  of  coefficients  in  the  conditional  means, 

c*  =  (Cji, ...  ,  Cjj.i,  Ci,i+i, ...  ,  Cin)  (i  =  1. ...,  n). 

From  the  conventional  theory  for  the  conditional 
multivariate  normal  distribution  (Rao,  1965,  p.  441)  and 
expressions  for  the  inverse  of  a  partitioned  symmetric  matrix 
(Rao,  1965.  p.  29), 


c«  =  -(T‘‘)-*  T‘'<‘,  h?  =  (T»)-*, 

where  T“  is  the  element  in  row  i  and  column  i  of  T'^  and 
TL*^*  is  row  i  of  T'*  with  T‘*  deleted.  These 
computations  need  only  be  performed  once,  before  sampling 
begins.  An  initial  value  z(®)  may  be  selected  by  setting  z 

=  0  and  then  successively  applying  (3.3)  for  i  =  1 . n. 

At  the  end  of  each  pass  we  compute  x(i)  =  \i  +  D‘^z(j\ 
Samples  from  the  truncated  multivariate  normal 
distribution  are  typically  used  to  estimate  the  expected  value 
of  a  function  g(  )  of  the  random  vector  x, 

g  =  Jxg(x)f(x)dx. 

An  assessment  of  the  reliability  of  this  estimate  must  take 
into  account  the  facts  that  in  general  {x(i)}  is  a  serially 
correlated  process,  whose  unconditional  distribution 
converges  to  f(  )  rather  than  being  identical  with  f(-). 
These  problems  are  taken  up  for  the  general  case  in  Geweke 
(1991),  where  standard  spectral  analytic  techniques  are  used 
to  produce  diagnostics  for  the  convergence  of  the  sampled 
distributions  to  f(')  and  to  provide  a  numerical  standard 
error  for  the  reliability  of  the  estimated  expected  value. 
Using  this  approach,  five  statistics  from  the  sample 

{x(j))^_l  provide  information  about  the  expected  value  of 
the  function  in  question. 

P 

(1)  The  simple  arithmetic  meon  gp  =  P‘*X8(*^^)  'S 

j=l 

the  most  efficient  estimate  of  g  from  a  Gibbs  sample  of 
size  p  passes,  assuming  that  departures  from 
convergence  in  the  sample  are  negligible.  The  function 
g(x)  is  computed  at  the  end  of  each  pass,  following  the 
transformation  from  z  to  x. 

(2)  The  sampling  variance  of  gp  is  Sg.(0)/p,  where 
Sgfw)  denotes  the  spectral  density  of  the  Gibbs- 
sampled  g(x)  process  at  frequency  m.  The  numerical 
standard  error  (NSE)  of  gp.  is  [p'^Sg(0)]*(^,  where 
Sg((o)  is  a  consistent  (in  p)  estimator  of  Sg(to). 

(3)  The  variance  of  g(-)  is  estimated  in  the  same  way  as 
the  mean  of  g(-).  The  ratio  of  this  variance  to  Sg(0) 
indicates  the  ratio  of  the  number  of  i.i.d.  draws  that 
would  have  been  required,  were  such  an  algorithm 
available,  to  the  number  of  passes  required  with  the 

Gibbs  sampler,  to  produce  an  estimate  of  g  of 
equivalent  reliability.  Following  Geweke  (1989),  this 
ratio  is  called  the  relative  numerical  efficiency  (RNE)  of 
the  Gibbs  sampling  procedure. 

(4)  A  convergence  diagnostic  (CD)  is  computed  based  on 
subsamples  of  the  sampled  g(x);  see  Geweke  (1991. 
Section  3.2)  for  details.  Under  a  stationary  distribution 

for  {x(J))P  ,  this  statistic  has  a  standard  normal 

’j=l 

distribution. 
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(5)  The  spectral  density  provides  further  details  on  the 
characteristics  of  the  process  {g(x(j))}P_j.  If  the 

spectral  density  is  nearly  flat,  or  is  lower  near  to  =  0 
than  at  other  frequencies,  then  the  Gibbs  sampling 
process  is  efficient  relative  to  i.i.d.  sampling.  But  if 
the  spectral  density  is  much  higher  near  (0  =  0  than 
elsewhere,  the  process  is  inefficient.  Thus,  there  is  a 
correspondence  between  the  shape  of  the  spectral  density 
and  the  RNE  of  the  Gibbs  sampler. 

The  Gibbs  algorithm  was  programmed  in  double 
precision  Fortran-77  using  the  IMSL/STAT  library,  on  a 
Sun  Sparcstation  4/40  (IPC).  The  routine  was  tested  by 
comparing  the  distribution  of  truncated  normal  samples  with 
those  generated  by  a  naive  accept/reject  procedure.  To 
provide  some  indication  of  the  efficiency  of  the  procedure, 
we  present  two  examples  here. 

The  frrst  example  is  a  truncated  bivariate  normal,  with 
parameters  chosen  so  that  convergence  ought  to  be 
especially  slow.  Both  variables  have  mean  zero.  The 
variance  of  xj  is  10,  while  the  variance  of  X2  is  0.1,  and 
the  restrictions  are  of  the  form  ai  <  xj  +  X2  5  bi,  a2  ^ 
X 1  -  X2  5  b2.  Consequently  the  transformed  variables  zi 
and  Z2  have  correlation  .98.  As  elaborated  in  Geweke 
(1991),  this  implies  that  the  process  z<J),  and  hence  xCi). 
will  exhibit  strong  positive  serial  correlation:  e.g.,  if 
ai  =  -oo  and  bi  =  +«,  then  each  element  of  zCi)  will 
follow  a  first  order  autoregressive  process  with  parameter 
.96.  Results  are  presented  in  Table  2,  which  shows  the  five 
statistics  for  five  different  configurations  of  truncation 
points  (ai,  bi),  and  for  three  choices  of  the  number  of 
passes,  p  =  400,  2000,  or  10,000.  In  each  case  p 
preliminary  passes  were  performed  before  the  functions  of 
interest  gi(x)  ( i  =  1, ... ,  4)  were  computed  and  averaged 
over  the  next  p  passes.  Computation  times  varied  about 
20%  depending  on  the  (aj,  bj)  configurations,  averaging 
about  .35  seconds  for  p  =  4(W  and  7.1  seconds  for  p  = 
10.000. 

The  results,  presented  in  Table  2,  confirm  that 
convergence  is  slow  for  the  untruncated  normal  distribution, 
panel  A.  (This  is  presented  as  a  limiting  case;  obviously 
Gibbs  sampling  is  not  the  method  of  choice  for  this 
problem.)  Even  when  p  =  10,000,  results  are  unreliable, 
as  indicated  by  the  convergence  diagnostics.  The  problem 
arises  from  the  strong  serial  correlation  in  the  processes 
{g(x0))},  which  is  not  fully  evident  in  the  estimated 
spectral  densities  for  the  smaller  values  of  p; 
correspondingly,  computed  RNE  falls  as  p  increases. 
These  results  persist  in  the  second  case,  in  which  Z2  is 
truncated  at  about  1.5  standard  deviations  above  and  below 
(Panel  B),  but  are  not  so  strong.  In  both  cases  the 
convergence  diagnostic  is  an  imperfect  indicator  of  unreliable 

estimates  of  g,  for  there  are  several  cases  in  which  gp  is 
more  than  three  times  NSE  from  0  (the  known  true  value  of 
g  in  ail  cases  except  D)  and  yet  CD  is  less  than  1.5  in 


absolute  value.  In  the  third  case  Z2  is  truncated  at  about 
.15  standard  deviations  above  and  below  (Panel  C),  and 
performance  is  satisfactory  for  all  values  of  p.  The  same  is 
true  in  the  fourth  case,  in  which  the  bivariate  normal 
distribution  is  truncated  to  an  extreme  tail  in  both 
dimensions  (Panel  D),  and  in  the  fifth  case,  in  which  the  the 
truncation  produces  a  distribution  closer  to  uniform  than  to 
bivariate  normal  (Panel  E).  Severe  truncation  diminishes 
the  potential  for  strong  serial  correlation  in  the  x(j\  and 
thereby  increases  the  efficiency  of  the  Gibbs  sampler. 

The  second  example  is  constructed  to  resemble  the 
truncated  multivariate  normal  distribution  that  might  be 
encountered  in  Bayesian  inference  with  a  multivariate  (Hobit 
model  with  panel  data  and  serial  correlation  in  equation 
disturbances  for  the  same  sampling  unit  and  different  years. 
Assuming  three  choices,  five  years,  and  a  first-order 
autoregressive  process  for  the  disturbance  leads  to  a  variance 
matrix  E  =  R  ®  I3,  rjj  =  in  a  15-variate  normal  with 
truncation  restrictions  that  require  one  of  X3j+i,  X3j+2.  and 
to  be  greater  than  the  other  two,  for  j  =  1,  ...  ,  5. 
Results  are  presented  in  Table  3  for  four  different  values  of 
p,  ranging  from  p  =  .00  to  p  =  .95.  The  number  of  passes 
and  preliminary  passes  are  the  same  as  those  in  the  previous 
example,  and  computation  times  range  from  2.5  seconds  for 
p  =  400  to  about  60  seconds  for  p  =  10,000.  As  p 
increases,  serial  correlation  in  z(j)  and  hence  x(j) 
increases,  diminishing  the  efficiency  and  reliability  of  the 
Gibbs  sampling  algorithm.  The  convergence  diagnostic 
proves  to  be  a  reliable  indicator  of  the  reliability  of  the 
estimates  gp.  For  p  =  .00, 4(X)  passes  are  reliable,  despite 
some  modest  serial  correlation;  for  p  =  .50, 2000  passes  are 
required,  and  for  p  =  .80, 10,000  passes  are  required.  For  p 
=  .95,  even  10,000  passes  do  not  produce  reliable  results. 

4  The  Gibbs  algorithm  for 
truncated  multivariate 
Student-t  sampling 

A  closely  related  problem  arising  in  Bayesian  inference 
is  the  generation  of  samples  from  the  multivariate  Student-t 
distribution  subject  to  linear  restrictions, 

X  ~T(p,  E;  m),  a  <  Dx  <  b. 

We  continue  to  make  the  same  assumptions  about  a,  b,  and 
D.  TTie  genesis  of  the  multivariate  Student-t  as  the  ratio  of 
a  multidimensional  normal  to  an  independent  [X^(ni)/m]^/2 
leads  immediately  to  a  Gibbs  sampling  algorithm  for  (w, 
zj,...  ,Zn)  followed  by  the  construction  x  =  p  +  D'lzw^ 

At  the  start  of  pass  j,  wO  D  and  zO-D  are  available 
from  the  previous  pass.  In  the  first  step  draw  wCi)  ~ 
subject  to  the  restrictions 

ajwCj)  <  zp-*)  <  Pjwti)  (i  =  1 . n), 

using  an  acceptfreject  procedure.  In  steps  2 . n+1, 

draw  z  from  a  multivariate  normal  distribution  conditional 
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on  the  pertinent  z’s,  and  the  restrictions  < 

z?  =  Zcijzp^  +  Xcijzjj'^)  +  hiCi, 
j=l  j=i+l 

i-1  .  n  . 

ei  -  TN[(aiW(j)-  Xcijz^^  -  Lcijzp'‘b/hi, 
j=l  j=i+l 

i-I  .  n  . 

(PiwO).  XcijZp^  -  Scijz[^‘^b^i)]. 

j=l  j=i+l 

At  the  end  of  the  pass,  =  p  +  D'*zO)w(j). 

This  algorithm  was  programmed  in  double  precision 
Fortran-77  using  the  IMSL/STAT  library.  The  routine  was 
tested  by  comparing  the  distribution  of  truncated  Student-t 
samples  with  those  generated  by  a  naive  accept/reject 
procedure.  No  appreciable  increases  in  computation  time 
over  corresponding  problems  with  the  truncated  multivariate 
normal  distribution  were  n-oted.  In  particular,  the 
accept/reject  procedure  for  w^i)  appears  quite  efficient,  even 
for  m  =  2.  No  considerations  with  respect  to  the  efficiency 
of  the  Gibbs  sampling  algorithm,  beyond  those  for  the 
multivariate  normal,  have  been  noted. 
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Table  2 

Properties  of  the  Gibbs  Sampler  for  a  Truncated  Bivariate  Normal  Distribution 

Pi  =  H2  =  0,  011  =  0.1,  022  =  10.  012  =  0 


A: 

-OO  < 

XI  +  X2 

'-OO  <  XI  -  X2  < 

OO 

1 - 

gl(x)  = 

XI  1 

1 - g2(x)  = 

:  X2  - 

1  1 - g3(x)  =  XI  +  X2  -1 

1  1 - g4(x)  =  XI 

-X2  -1 

p 

400 

2,000 

10,000 

400 

2,000 

10.000  400 

2,000  10,000 

400 

2,000 

10,000 

Mean 

-.0093 

-.0030 

-.0048 

.6207 

.5295 

.4648 

.6113 

.5265  .4600 

-.6301 

-.5325 

-.4695 

NSE 

.0172 

.0068 

.0027 

.4515 

.2461 

.1516 

.4522 

.2463  .1518 

.4514 

.2462 

.1515 

RNE 

.932 

1.077 

1.312 

.170 

.085 

.042 

.171 

.085  .043 

.170 

.086 

.043 

CD 

-12.539 

.669 

-2.319  -12.550 

.669 

-2.319  - 

12.539 

.715  -2.342  12.453 

-.620 

2.293 

Sg(0) 

.1164 

.0926 

.0733  80.31  120.37  229.24 

80.57  120.48  229.74  80.28  120.44  228.90 

Sg(n/2) 

.0941 

.1104 

.0999 

.3610 

.3651 

.3002 

.4556 

.4491  .3911 

.4548 

.5018 

.4092 

B: 

-OO  < 

XI  +  X2 

<  «>, 

-5  <  XI  -  X2  < 

5 

gl(x)  =  XI - 1  I - g2(x)  =  X2  - 1  I - g3(x)  =  XI  +  X2  -I  I - g4(x)  =  xi  -  X2  -I 


P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

.0071 

.0094 

-.0016 

.9331 

.5474 

.03667 

.9402 

.5568 

.0351 

-.9261 

-.5380 

-.0383 

NSE 

.0139 

.0065 

.0028 

.2921 

.1737 

.1079 

.2947 

.1749 

.1087 

.2901 

.1728 

.1072 

RNE 

1.105 

1.197 

1.242 

.177 

.091 

.051 

.180 

.093 

.052 

.179 

.092 

.052 

CD 

-.808 

.927 

-2.288 

-.111 

2.605 

-.959 

-.177 

2.653  - 

1.041 

.044 

-2.544 

.875 

Sg(0) 

.0761 

.0083 

.0079  33.607 

59.950  116.023  34.220 

60.743  117.705  33.148 

59.321  114.499 

Sg(n/2) 

.0886 

.1097 

.1019 

.3438 

.3619 

.2894 

.4561 

.5057 

.4164 

.4087 

.4376 

.3663 

C; 

-OO  < 

XI  +  X2  < 

-.5  <  X] 

-X2  < 

.5 

1 - 

gUx)  = 

XI  1 

1 - 

g2(x)  = 

:  X2  —1 

1— g3(x)  =  XI  +  X2  -1 

1  1 - g4(x)  =  XI 

-X2  -1 

P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

.0282 

-.0018 

-.0057 

.0445 

-.0080 

-.0135 

.0727 

-.0098 

-.0192 

-.0162 

.0061 

.0078 

NSE 

.0150 

.0065 

.0032 

.0272 

.0113 

.0056 

.0405 

.0169 

.0085 

.0170 

.0073 

.0035 

RNE 

1.060 

1.125 

.953 

.638 

.667 

.573 

.726 

.782 

.665 

.796 

.771 

.688 

CD 

-.934 

.251 

-.963 

-.793 

.667 

.573 

-.902 

.782 

.665 

.265 

.341 

-.352 

Sg(0) 

.0889 

.0833 

.1050 

.2913 

.2530 

.3144 

.6461 

.5677 

.7182 

.1143 

.1049 

.1205 

Sg(n/2) 

.1012 

.1007 

.0994 

.2126 

.1487 

.1763 

.5304 

.4176 

.4702 

.0097 

.0081 

.0081 

D:  10  < 

XI  +  X2  < 

10  <  XI 

-X2  < 

OO 

1 - : 

gl(x)  = 

XI  1 

1 - 

g2(x)  = 

!  X2  - 1 

1 - g3(x)  =  XI 

+  X2  -1 

1  1 - g4(x)  =  XI 

-  X2  -1 

P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

10.020 

10.019 

10.020 

.0003 

.0002 

.0001  10.020 

10.020  10.020  10.020 

10.019  10.020 

NSE 

.0007 

.0003 

.0001 

.0006 

.0003 

.0001 

.0010 

.0004 

.0002 

.0010 

.0005 

.0002 

RNE 

.839 

.802 

.910 

1.089 

1.091 

1.053 

.888 

1.039 

.941 

1.008 

.815 

1.015 

CD 

2.246 

.781 

-.255 

.184 

-.045 

-.001  1.656 

.550 

-.191 

1.855 

.577 

-.181 

Sg(0) 

.0002 

.0002 

.0002 

.0002 

.0002 

.0002 

.0004 

.0004 

.0004 

.0004 

.0004 

.0004 

Sg(n/2) 

.0002 

.0002 

.0002 

.0002 

.0002 

.0002 

.0004 

.0004 

.0004 

.0004 

.0004 

.0003 

E: 

-.5  < 

XI  +  X2  < 

.5, 

-.5  <  XI 

-X2  < 

.5 

1 - 

gl(x)  = 

XI  1 

1 - 

g2(x)  = 

:  X2  - 1 

1 - g3W  =  XI 

+  X2  -1 

1  1 - g4(x)  =  XI 

-  X2  -1 

P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

-.0041 

.0001 

-.0011 

.0091 

.0072 

.0032 

.0050 

.0073 

.0020 

-.0132 

-.0072 

-.0043 

NSE 

.0073 

.0035 

.0018 

.0114 

.0052 

.0026 

.0142 

.0066 

.0032 

.0127 

.0059 

.0031 

RNE 

1.392 

1.338 

1.013 

.956 

.843 

.683 

.132 

.901 

.762 

1.221 

1.117 

1.363 

CD 

-.971 

-.754 

.249 

.745 

.898 

-1.526 

.132 

.276  - 

1.172  - 

1.264 

-1.229 

1.363 

Sg(0) 

.0209 

.0240 

.0310 

.0509 

.0538 

.0676 

.0804 

.0875 

.1025 

.0631 

.0680 

.0947 

Sg(n/2) 

.0351 

.0319 

.0291 

.0511 

.0381 

.0477 

.0893 

.0730 

.0809 

.0831 

.0672 

.0725 
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Table  3 

Properties  of  the  Gibbs  Sampler,  Truncated  15-Variate  Normal  Distribution 


Xj>X2,x,>X3;  X5>X4. 


gl(x)  *  £x3i.2 
i=l 


l3.  fy  =  pi 

x«:  x,^  > 


p  =  0,  I  =  R 


x^  >  Xg,  x^  >  x^;  X 

A:  p  =  .00 
5 

—  g2W=  Ix3i.i 
i=l 


>  Xjp,  Xj2  5  Xjj;  Xj^  >  Xjj,  Xj^  >  Xj3 


g3(x)==  Xx3i  — > 

i=l 


p 

400 

2,000 

10,000 

400 

2.000 

10,000 

400 

2,000 

10,000 

Mean 

.4336 

.3444 

.4263 

.2821 

.4637 

.4011 

-.8200 

-.8176 

-.8478 

NSE 

.0786 

.0397 

.0181 

.0984 

.0383 

.0176 

.0917 

.0379 

.0186 

RNE 

1.343 

1.033 

.954 

.845 

1.072 

1.015 

.986 

1.169 

.964 

CD 

.750 

-.500 

-.081 

-.135 

-1.086 

-.310 

1.786 

.074 

-.622 

Sg(0) 

2.4354 

3.1338 

3.2663 

3.8149 

2.9267 

3.0835 

3.3118 

2.8582 

3.4432 

Sg(n/2) 

3.6108 

2.9291 

3.2524 

3.2953 

3.0835 

3.2071 

3.2886 

3.9327 

3.2096 

gl(x)=  Xx3i.2- 
i=l 


B:  p  =  .50 

5 

g2(x)=  Xx3i.i 

i=l 


g3(x)*  Xx3i  - < 

i=l 


P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

.4790 

.5861 

.5312 

.4222 

.7204 

.6776 

-1.6333 

-1.0562 

-1.1207 

NSE 

.1711 

.0847 

.0386 

.1777 

.0806 

.0382 

.1833 

.0827 

.0415 

RNE 

.382 

.354 

.341 

.397 

.373 

.340 

.432 

.402 

.307 

CD 

-2.871 

.541 

1.403 

-4.179 

1.769 

1.359 

-2.694 

.126 

.842 

Sg(0) 

11.536 

14.273 

14.861 

12.441 

12.922 

14.554 

13.244 

13.582 

17.165 

Sg(rt/2) 

3.235 

3.079 

3.176 

3.446 

3.469 

C:  p 

3.132 

=  .80 

3.990 

3.748 

3.554 

gl(x)  =  Xx3i-2 
i=l 


g2(x)=  Xx3i.i 
i=l 


g3(x)=  Xx3i  - 1 

i=l 


P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

.7862 

.2454 

.4784 

1.0146 

.4719 

.7426 

-.6402 

-1.1973 

-.9049 

NSE 

.2363 

.1661 

.0830 

.2376 

.1694 

.8232 

.2248 

.1778 

.0858 

RNE 

.228 

.126 

.095 

.231 

.123 

.094 

.245 

.120 

.094 

CD 

-1.709 

-2.347 

.564 

-1.613 

-3.090 

.101 

-2.316 

-4.281 

.638 

Sg(0) 

22.007 

54.795 

68.716 

22.238 

57.024 

67.556 

19.913 

62.802 

73.314 

Sg(jr/2)1,652 

1.495 

1.425 

1.448 

1.409 

1.351 

1.781 

1.693 

1.699 

D:  p  =  .95 


gl(x)=  X*3i-2 
i=l 


g2(x)=  Xx3i.i 

i=l 


g3W*  Xx3i 
1=1 


P 

400 

2,000 

10,000 

400 

2,000 

10,000 

400 

2,000 

10,000 

Mean 

-1.0142 

1.0650 

-.1249 

-.8501 

1.2505 

.0587 

-1.7441 

.2082 

-.9101 

NSE 

.2928 

.2660 

.1298 

.2884 

.2665 

.1295 

.2913 

.2661 

.1311 

RNE 

.176 

.081 

.042 

.175 

.081 

.041 

.177 

.082 

.042 

CD 

4.001 

2.166 

2.280 

3.957 

1.988 

2.229 

3.231 

1.603 

2.036 

Sg(0) 

33.78 

140.06 

167.97 

32.78 

141.11 

167.26 

33.44  140.71 

171.34 

Sg(n/2) 

.42 

.42 

.38 

.33 

.35 

.34 

.43 

.41 

.42 
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ABSTRACT 

Principal  components  analysis  is  already  used  in  multi- 
spectral  image  aucilysis  to  reduce  the  number  of  spectreii 
dimensions.  We  propose  to  use  projection  pursuit  to  find 
interesting  combinations  of  spectral  variates  that  pro¬ 
duce  images  that  enhance  contrast  differences  between 
differing  land-use  types.  We  develop  a  3-dimensional 
moment  index  based  on  Jones  and  Sibson ’s  index  for 
projection  into  2-dimensioiis. 


1  Introduction. 

Remote  sensing  is  an  indispensable  tool  in  many  scien¬ 
tific  disciplines.  It  is  one  of  the  major  tools  in  moni¬ 
toring  our  own  environment  in  a  cost-effective  way.  We 
will  be  investigating  methods  of  treating  multispectral 
images,  which  reduce  the  number  of  spectral  dimensions, 
without  losing  significant  information.  This  information 
extraction  process  has  been  performed  in  meiny  ways  in 
the  past.  We  develop  the  necessary  techniques  to  per¬ 
form  projection  pursuit,  which  is  to  be  used  in  a  similar 
role  to  principal  components  analysis. 


2  The  Practical  Problem. 

The  NERC  Computer  Services  kindly  supplied  us  with 
much  thematic  mapper  data.  These  data  sets  consist  of 
images  collected  by  a  Daedalus  thematic  mapper,  flown 
in  an  aeroplane  above  the  area  to  be  remote  sensed.  The 
mapper  passively  senses  12  different  spectral  channels. 

A  monoimage  of  the  area  is  recorded  at  each  spectral 
frequency.  The  image  that  we  decided  to  use  was  one 
of  the  Chew  Valley  Lake,  Somerset,  UK.  We  decided  to 
use  this  image  since  it  has  a  good  mix  of  land  and  water 
features.  Each  monoimage  consists  of  1254x715  pixels, 
which  take  discrete  values  in  the  range  of  0  to  255.  We 
generally  operate  upon  sections  of  the  whole  image. 

Table  1  details  the  frequencies  that  the  scanner  de¬ 
tects. 


Channel 

Frequency  (/r  ni) 

1 

0.42  -  0.45 

2 

0.45  -  0.52 

3 

0.52  -  0.60 

4 

0.605  -  0.625 

5 

0.63  -  0.69 

6 

0.695  -  0.75 

7 

0.76  -  0.90 

8 

0.91  -  1.05 

9 

1.55  -  1.75 

10 

2.08  -  2.35 

11 

8.50  -  13.00 

12 

8.50  -  13.00 

Table  1:  Spectral  channels  sensed  by  NERC  Daedalus 
thematic  mapper. 


2.1  Viewing  the  image. 

One  thing  we  would  want  to  do  with  this  image  is  look 
at  it.  We  could  view  12  separate  monoimages,  but  it  is 
useful  to  somehow  combine  the  images  to  form  a  colour 
image.  Colour  is  effective  for  highlighting  differences  in 
land  use  and  type,  and  directs  the  eye  to  various  fea¬ 
tures. 

We  would  generally  view  the  image  on  a  CRT  mon¬ 
itor,  and  would  maybe  later  obtain  a  hardcopy.  Most 
colour  monitors  use  the  red-green-blue  (RGB)  system  of 
specifying  colours  (to  span  the  3D  colour  space  that  hu¬ 
mans  perceive[l]),  although  this  is  not  the  only  system 
that  we  could  use.  One  way  to  obtain  a  quick  view  of 
the  image  is  to  choose  three  mapper  bands  and  assign 
them  to  one  of  the  RGB  colours. 

The  difficult  question  is:  what  mapper  frequencies  do 
we  use,  and  which  colours  do  we  assign  them  to?  Note 
also,  that  there  are  ways  of  choosing  such 

assignments  (e.g.  —  1320).  To  view  all  of  them, 

and  select  good  images,  is  at  best  non-objective, and  at 
worst,  horrendously  time-consuming. 
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2.2  Multivariate  methods. 

We  wish  to  move  onto  mote  incisive  techniques  of  vari¬ 
able  reduction.  For  these  techniques,  we  wish  to  consider 
the  image  as  a  multivariate  data  set.  To  do  this  we  re¬ 
gard  spectral  channels  as  variates,  and  pixels  as  cases. 
We  will  let  K  represent  the  number  of  variates,  and  N 
the  number  of  cases  (e.g.  K  =  12, JV  =  896610). 

2.3  Why  dimension  reduction? 

To  end  this  section  we  mention  two  other  reasons  why 
dimension  reduction  is  a  useful  processing  step. 

It  is  very  common  to  run  an  automatic  classifier  over 
an  image.  Due  to  the  curse  of  dimensionality  (see  [4]) 
these  algorithms  can  become  confused,  and  work  much 
better  in  lower  dimensions. 

Secondly,  the  amount  of  remotely  sensed  data  col¬ 
lected  is  increasing  at  an  alarming  rate,  and  so  knowing 
what  to  keep  is  important. 

2.4  Data  quality. 

From  monoimages  we  have  found  spectral  channels  1  and 
7  to  be  very  noisy.  Also,  channel  12  records  at  the  same 
frequency  cis  channel  11,  except  at  a  different  gain  level. 
For  these  reasons  we  have  discarded  channels  1,7  and  12 
from  the  analysis  giving  an  effective  set  of  nine  variates. 

3  Analysis  by  Principal  Components  Analysis. 

Principal  components  analysis  is  an  established  multi¬ 
variate  technique  already  used  for  dimension  reduction 
in  image  analysis  (where  it  is  also  known  cis  decorrela¬ 
tion.  Full  and  detailed  treatments  of  principal  compo¬ 
nents  analysis  can  be  found  in  most  applied  multivariate 
texts  (e.g.  [6]). 

We  compute  principal  components  from  the  correla¬ 
tion  matrix  of  the  image.  The  correlation  matrix  usually 
tells  us  that  channels  of  similar  frequency  are  highly  cor¬ 
related. 

Since  humans  perceive  a  3D  colour  space,  we  will  usu¬ 
ally  choose  the  3  principal  components  associated  with 
the  3  largest  eigenvalues. 

3.1  Results  of  principal  components  analysis. 

In  Table  2  we  display  a  typical  set  of  eigenvalues.  From 
this  one  can  see  that  the  first  3  principal  components 
account  for  over  90%  of  the  variation  inherent  in  the 
data  (so  maybe  3  dimensions  are  adequate).  The  first 
principal  component  in  our  example  is  typically  not  very 
far  from 


Number 

Eigenvalue 

%  Variance  Expl. 

1 

6.88 

76 

2 

1.50 

17 

3 

0.387 

4.3 

4 

0.130 

1.4 

5 

0.0569 

0.63 

6 

0.0323 

0.36 

7 

0.0138 

0.15 

8 

0.00612 

0.068 

9 

0.00149 

0.017 

Table  2;  Eigenvalues  from  typical  principal  components 
analysis. 

In  layman's  terms,  the  first  principal  component  appears 
to  be  a  roughly  equal  combination  of  all  the  original 
spectral  variates.  This  component  has  a  intuitive  in¬ 
terpretation  as  a  brightness  variate  and  so  we  cissign  it 
to  the  B  of  the  hue-saturation-brightness  (HSB)  colour 
model. 

The  remaining  principaJ  components  are  usually  con¬ 
trasts  of  certain  channels.  On  an  rendered  image  this 
htis  the  effect  of  providing  contrast  enhancements. 

4  Analysis  by  Projection  Pursuit. 

4.1  What  is  projection  pursuit? 

Exploratory  projection  pursuit  can  be  used  for  the  same 
purposes  as  principal  components  analysis.  We  wish  to 
use  the  cluster-detecting  ability  of  projection  pursuit, 
just  as  we  would  with  ordinary  multivariate  data. 

We  do  not  wish  to  describe  exploratory  projection 
pursuit  in  great  detciil.  Interested  readers  should  con¬ 
sult  [5]  or  [2]  for  more  information. 

4.2  Projection  pursuit  into  3  dimensions. 

Many  projection  indices  have  been  proposed  in  the 
literature[5]  [2] [3]  None  have  yet  been  explicitly  devel¬ 
oped  for  projection  into  3  dimensions,  although  for  some 
it  is  a  relatively  trivial  matter  to  do  so.  We  also  prefer 
an  index  that  is  rotationally  invariant  with  respect  to 
the  chosen  basis  in  the  projection  space. 

However,  the  overriding  consideration  for  us  is  com¬ 
putational  efficiency.  All  indices  in  the  literature  (that 
we  know  of)  have  a  computational  effort  of  order  N  or 
larger.  One  index  that  almost  overcomes  this  barrier  is 
the  moment  index  described  in  [5],  Once  a  set  of  sum¬ 
mary  statistics  is  computed  for  a  data  set  the  subsequent 
computation  of  an  optimal  projection  solution  does  not 
depend  on  N.  Since  a  common  method  of  searching  for 
optimal  projections  depends  on  many  random  starting 
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positions,  this  independence  of  N  is  very  useful,  since 
for  these  image  problems  N  is  usucilly  very  large. 

4.3  Review  of  the  moment  index. 

The  moment  index[5]  is  derived  from  the  order- 1  neg¬ 
ative  Shannon  entropy  index.  We  have  extended  the 
moment  index  to  a  3D  space  and  obtain 

T  *1"  ^^201  T  ^^120  T  ^^111 

+  ^(53,,  3A:q2i  +  3^012  +  ^(Ko] 

^  “!■  ^^301  ®^220  ^^^211 

+  6^202  ~t“  ^^13(1  “b  ^2^121  “1"  ^2^112  ^^1(13 

+  ^040  +  '^^031  +  ®^022  +  '*^013  +  ^604)  • 

as  our  3D  index,  where  k,,,  are  tri variate  f;-statistics. 

This  projection  index  is  rotationally  invariant  with 
respect  to  choice  of  basis  in  the  projection  space,  and 
the  derivatives  with  respect  to  the  projection  space  can 
be  calculated. 

4.4  Sphered  images. 

Sphering^  [7j  is  a  transformation  that  transforms  the 
original  data  set  into  one  that  has  zero  mean  and  identity 
variance. 

It  is  very  interesting  to  observe  the  results  of  the 
sphering  process  applied  to  the  image  data.  What  al¬ 
most  seems  like  a  ghost  picture  of  the  “original”  results. 
Certain  things  remain,  for  example,  edges  of  fields,  cer¬ 
tain  buildings,  indicative  of  jump  changes  in  intensity 
which  will  not  be  accounted  for  by  linear  correlation. 

4.5  Results  of  projection  pursuit. 

Once  we  have  a  3  dimensional  projection  solution  we 
still  have  to  decide  how  we  are  to  apply  the  solution  to 
the  RGB  guns  of  a  CRT.  Usually  the  projection  solution 
is  transformed  back  to  the  unsphered  space  of  variates, 
and  then  principal  components  is  applied  to  the  data  in 
this  space. 

Unlike  principal  components  analysis,  projection  pur¬ 
suit  finds  no  brightness  component,  this  is  probably  due 
to  the  action  of  sphering.  Projection  pursuit  finds  linear 
combinations  that  it  finds  interesting. 

The  moment  index  has  been  criticised  in  the  past  for 
rewarding  projections  which  contain  outliers.  We  use 
this  and  the  image’s  spatial  structure  to  our  advan¬ 
tage  to  find  prominent  outlier  features,  having  unique 
reflectance  properties. 

'  also  known  as  the  Mahalanobis  transfonnation[6] 


Otherwise,  projection  pursuit  finds  interesting  con¬ 
trasts  of  the  original  variates,  which  are  usually  different 
from  those  found  using  principal  components.  Some¬ 
times,  one  finds  that  ground  structure  is  highlighted 
more  effectively  with  a  projection  pursuit  contrast  than 
a  principal  components  one. 

5  Conclusions. 

We  take  the  view  that  projection  pursuit  should  act  in 
a  complementary  role  to  principal  components  analysis. 
It  hcis  the  potential  to  find  interesting  clusters  and  act 
as  a  valuable  dimension-reducer. 

After  practical  experience  with  colour  images  and 
their  manipulation,  we  realise  how  dangerous  it  is  to 
compare  the  performance  of  various  methods  when  the 
output  is  a  colour  image.  Sometimes  changing  the  colour 
assignments  in  an  image  can  be  more  revealing  than 
changing  a  linear  combination  of  channels. 

However,  for  automatic  classifiers  and  storage  we  must 
be  able  to  reduce  dimension  effectively,  without  losing 
too  much,  and  projection  pursuit  will  be  useful  here. 

We  must  investigate  the  use  of  other  colour  models. 
We  have  used  RGB  and  HSB  models  here,  there  may  be 
others  which  might  fit  in  more  naturally. 

We  could  also  try  other  projection  indices,  or  search 
for  projection  spaces  one-dimension  at  a  time. 
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1  Introduction 

We  consider  the  following  problem:  a  black  and  white 
image  is  observed  in  digitized  form.  Unfortunately  the 
‘real’  image  is  not  observed:  at  some  stage  the  image  has 
been  distorted  with  noise.  Our  objective  is  to  remove  as 
much  of  the  noise  as  possible,  to  get  approximately  the 
original  image  back. 

In  a  more  mathematical  setting,  let  x  be  an  m  by  n 
array,  with  entries  0  and  1;  x  is  considered  to  be  a  real¬ 
ization  of  a  random  variable  X.  We  do  not  observe  the 
image  x.  Instead  vve  observe  y,  a  noisy  version  of  x  that 
is  a  realization  of  the  random  variable  F,  where  the  dis¬ 
tribution  of  y  depends  on  x.  We  want  to  estimate  x  on 
the  basis  of  y. 

The  problem  that  we  are  discussing  is  a  specisd  case 
of  the  more  general  image  reconstruction  problem,  y  is  a 
set  of  records  generated  by  degradation  of  the  true  image 
X.  The  noisy  image  y  and  the  original  image  x  may  or 
may  not  be  closely  related.  Two  of  the  most  influential 
papers  discussing  these  problems  are  Besag  (1986)  and 
Geman  and  Geman  (1984).  Since  then  a  large  body  of 
literature  about  image  reconstruction  has  developed.  See 
Besag  (1989)  and  Geman  (1991)  for  a  review  of  this  area. 

A  common  assumption  is  to  put  a  Markov  Random 
Field  as  a  prior  on  the  images.  Using  a  Markov  Random 
Field  as  prior  on  the  images  leads  to  a  global  optimization 
problems  to  reconstruct  the  original  image.  There  are 
several  algorithms  to  deal  with  this  optimization  problem, 
for  example,  Gibbs  sampling,  simulated  annealing  and 
ICM. 

We  will  not  assume  a  M^lrkov  Random  Field  in  this 
paper.  Instead  it  is  assumed  that  the  probabilities  of 
observing  a  certain  pattern  in  the  image  are  the  same 
everywhere  in  the  im£^e.  We  will  study  the  independent 
Bernoulli  noise  case.  For  the  algorithm  which  we  will 
discuss  the  decision  about  the  {i,j)  pixel  in  the  original 
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image,  is  made  based  upon  those  pixels  in  the  observed 
image  that  are  within  a  window  around  (t,  j).  Except  for 
the  noise  level  e  all  statistics  required  to  make  a  decision 
about  this  pixel,  can  be  gathered,  in  an  empirical  Bayes 
fashion,  from  the  image. 

This  article  is  a  summary  of  chapter  1  of  Kooperberg 
(1991). 

2  A  Bayes  Window  Estimator  for 
Binary  Images 

Some  definitions  and  notation: 

Let  5  be  a  finite  subset  of  Z*  and  let  be  the  collection 
of  functions  on  the  elements  of  5  that  are  0-1  valued. 

Let  Anm  be  the  set  {(»,  j)  :  1  <  »  <  m,  1  <  j  <  n).  If 
S  —  Anm  we  can  think  of  as  the  collection  of  n  x  m 
arrays  with  entries  0  and  1.  We  will  write  B"’"  instead  of 
QA.n  jf  there  jg  no  confusion  we  will  omit  the  n  and  m 
and  we  will  write  B  instead  of  B"".  An  element  x  G  B 
esm  be  written  as  x  =  (xij,  1  <  »  <  m,  1  <  j  <  n). 

Let  (0, 0)  e  5.  If  (»,  J)  and  5  are  such  that  1  <  (i+k)  < 
m  Jind  1  <  (j  +  1)  <  n  for  all  (it,  1)  G  S,  we  can  define 
a  window-operator  Wij.  Informally,  W.-^x,  x  G  B  is  that 
part  of  X  that  falls  within  S,  when  S  is  positioned  such 
that  the  origin  of  5  is  positioned  at  (t,  jf). 

Define  a  window- operato-^  Wij  :  B""*  — ♦  B^  as: 

(W,jx)jj  =  Zi+tj^i,  (k,l)  G  S. 

We  also  need  to  define  the  center-less  window- operator 
Eij  :  B"”*  -+  as: 

(Eijx)i^l  =  X,4.i;,r  +  1’  (^>0  S  (^tO  f  (0,0). 

Thus  a  window-operator  cuts  a  piece  of  shape  S  from 
X  G  B"”*,  centered  at  (i,  j);  a  center-less  window-operator 
cuts  out  the  same  piece,  except  for  the  center  pixel  (i,j). 

Now  let  X  G  B  be  the  image  that  we  want  to  recon¬ 
struct.  X  is  a  realization  of  the  r£tndom  variable  X.  In¬ 
stead  of  X  we  observe  a  realization  of  Y,  y.  The  Yij's  are 
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conditionally  independent  given  that  X  =  x: 

PiYij  =  Xij  I  Xij  =  ^ij)  —  1  — 

P^Yij  =  1  Xij  I  Xij  ~  Xij)  =  e, 

with  0  <  e  <  0.5. 

Let  a?  €  B,  let  T  ;  B  — ►  B  be  an  estimator  of  x  based 
upon  y.  Define 

R{X,T)  =  J5  I  Xo  -  nn.i  I /(«m) 

=  /(«"») 

O 


3  An  Empirical  Bayes  Window 
Estimator  for  Binary  Images 

To  use  the  window  estimator  (1)  information  about  the 
distribution  of  WijY  (or  WijX;  to  compute  the  probabil¬ 
ities  in  (b)  above)  and  e  is  needed.  Although  information 
about  the  distribution  of  WijY  will  be  seldom  available, 
we  will  assume  that  e  is  (approximately)  known. 

To  get  information  about  the  distribution  of  WijY  we 
can  now  take  an  empirical  Bayes  approach,  and  use  the 
data  to  estimate  this  distribution.  If  we  cissume  that  the 
distribution  of  WijY  (and  thus  the  distribution  of  W.yX) 
does  not  depend  on  i  and  j  (homogeneity)  counting  for 
how  many  pixels  (kj)  Wijy  —  W^iy  gives  the  empirical 
estimator 


to  be  the  expected  miaclasaificaiion  error  of  T  as  estimator 
ofX. 

We  call  Ts  a  window-eatimator  if  Ts  ;  B  — ♦  B,  and  if 
Ts{Y)ij  is  independent  of  {Yki,  (k  —  i,l  -  j)  ^  S}  given 
WijY. 

Fix  a  window  S  C  Z^.  The  following  theorem  holds: 
Theorem;  (i)  The  vnndow-eatimator  Ts,  S  fixed,  which 
minimizes  the  expected  miaclaaaification  error  R{X,Ts) 
haa  the  form: 

(  yij  ifP(Yij  =  yij\EijY  =  Eijy) 

Tsiy)ij=<  >2£(1-£),  (1) 

I,  1  —  y»>  otherwise; 

(ii)  The  expected  miaclaaaification  error  achieved  by  this 
eatimator  ia; 


RiX,Ts)=^- 


2(1 -2e)  [/ 


P(Yij  =  yij  I  EijY  =  Eijy) 

^  E„UWijy=W„y) 
Ek,I(Eijy==E„y)' 

where  /(•)  is  the  usual  indicator  function. 

There  is  a  problem  with  this  estimator  though.  Clearly 
we  would  like  to  have  a  large  window  to  incorporate  as 
much  information  as  possible  in  the  decision.  However 
a  large  window  might  lead  to  very  small  counts  in  (2). 
Even  for  a  relatively  smedl  5x5  window  we  would  be 
counting  the  empiriceJ  distribution  on  2^^  =  16, 777, 216 
points.  Even  to  get  an  average  of  just  1  observation  in 
each  ceU  we  need  a  picture  of  4096  by  4096  points! 

One  possible  modification  is  to  use  a  large  window 
whenever  this  is  possible,  but  to  use  a  smaller  window 
if  the  estimator  in  (2)  would  be  based  on  very  smcdl 
counts.  For  example,  we  could  first  use  a  window  of  size 

• 

•  •  • 

13;  •  •  O  •  •  .  and  make  a  60%  confidence  interval  for 


where  U  =  P  (y;_,  =  1  -  y^^  |  EijY  =  Eijij). 

See  Kooperberg  (1991)  for  the  proof. 

What  does  this  mean?  It  says  that  one  gets  the  Bayes 
window  estimator  by  the  following  procedure: 

a)  Cover  the  pixel  (i,  j)  that  you  want  to  reconstruct. 

b)  Compute  the  probability  that  this  pixel  in  the  ohaerved 
image  is  white  (or  black).  (P(Yij  =  1  |  EijY  =  Eijy). 

c)  If  this  probability  makes  you  pretty  sure  (either 
P{Yij  =  1  I  EijY  =  Eijy)  >  1  -  2e(l  -  e)  or 
P{Yij  =  0  1  EijY  =  f;,^y)  >  1  -  2£(1  -  e)),  then 
this  is  the  Bayes  estimate  of  the  pixel  in  the  original 
image  Xij. 

d)  If  there  is  still  doubt,  remove  the  cover,  and  the  color 
that  you  observed  (y,^)  will  be  the  estimate  for  the 
original  color  (xij). 


• 

PiXa  =  Vij  I  Eijy  =  Eijy)  based  upon  Y>ki  = 

Wkty)  and  YLki^^Eijy  =  Ekiy)-  If  this  interval  does  not 
cover  2e(l  —  c)  we  make  a  decision,  while  if  it  does  cover 
2e(l  —  c)  we  make  the  decision  based  upon  a  window  of 

•  •  • 

size  9:  •  O  •  • 

•  •  • 

Another  modification  is  to  assume  left-right,  top- 
bottom  and/or  diagonal  symmetry  of  the  prior  distribu¬ 
tion  of  WijY.  Each  symmetry  reduces  the  number  of 
different  patterns  by  a  factor  of  2. 

These  two  modifications  are  used  in  our  examples. 
Other  possible  modifications  that  we  do  not  use  include: 
(i)  assume  black-white  symmetry  of  the  prior  distribu¬ 
tion  on  WijY;  or  (ii)  a  procedure  in  which  we  do  not 
only  count  those  patterns  that  are  exactly  the  same,  but 
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also  those  that  are  almost  the  same,  i.e.  differ  only  in 
one  or  two  points.  Those  that  are  different  would  than, 
conceivably,  make  a  smaller  contribution  than  those  that 
are  exactly  the  same.  The  re^lson  that  we  do  not  use  the 
later  idea  is  that  we  do  not  know  of  an  algorithm  to  im¬ 
plement  this  rule  that  would  use  less  than  0((nm)^)  time, 
while  without  this  idea,  we  can  implement  the  algorithm 
in  0(nm  log(nm))  time. 


4  Examples 

We  used  the  empirical  Bayes  window  reconstruction  rule, 
as  described  in  the  previous  section  on  a  number  of  initial 
examples.  Upon  examination  of  the  results,  we  concluded 
that  the  estimator  was  working  reasonably  well,  but  that 
it  left  too  many  small  spots  and  was  a  bit  to  rough  to 
please  our  eye.  This  is  actually  not  surprising:  our  es¬ 
timator  did  not  make  any  assumptions  about  smooth¬ 
ness,  while  in  practice  images  do  tend  to  be  (somewhat) 
smooth.  We  decided  to  carry  out  some  post-processing 
to  further  smooth  the  picture.  We  settled  on  the  fol¬ 
lowing  operation:  change  all  black(white)  pixels,  that  to¬ 
gether  with  at  most  12  other  black( white)  pixels,  are  not 
connected  to  any  other  black( white)  pixels  and  are  com¬ 
pletely  surrounded  by  white(black)  pixels. 

We  applied  the  algorithm  to  several  other  exeunples. 
Among  them  the  same  examples  as  were  used  in  Greig, 
Porteous  and  Seheult  (1989).  We  cidded  25%  Bernoulli 
noise  to  their  figure  1.  Our  reconstruction  had  4.8%  in¬ 
correct  estimated  pixels  after  post-processing  (9.8%  be¬ 
fore  post- processing).  The  methods  discussed  in  Greig 
et.al.  (1989)  (annealing,  ICM  and  exact  MAP)  had  be¬ 
tween  5.2%  and  5.4%  incorrect  estimated  errors. 

For  the  figure  that  was  first  u.sed  as  Figure  4  in  Besag 
(1986).  This  was  an  88<imesl00  hand-constructed  scene, 
designed  specifically  to  contain  some  awkward  features. 
We  applied  our  algorithm  with  30%  additive  Bernoulli 
noise.  There  were  6.5%  incorrect  classified  pixels  af¬ 
ter  post- processing.  For  the  other  methods  Greig  et.al. 
(1989)  obtained  between  5.4%  and  7.0%  incorrect  clas¬ 
sified  pixels  using  several  different  other  reconstruction 
methods. 

On  the  next  page  we  show  two  larger  examples.  Typ¬ 
ically,  for  images  with  the  amount  of  detail  as  these  fig¬ 
ures  have,  10%  incorrect  pixels  in  Y,  the  noisy  image, 
are  reduced  by  our  reconstruction  method  to  about  1% 
incorrect  pixels  in  X,  the  reconstructed  image.  20%  er¬ 
rors  in  Y  is  reduced  to  about  2-3%  in  X;  30%  errors  in 
Y  is  reduced  to  about  4-8%  in  A';  and  40%  errors  in  Y  is 
reduced  to  about  15-30%  in  X. 


5  Discussion 

We  have  introduced  a  reconstruction  rule  for  binary  im¬ 
ages.  The  rule  only  uses  the  information  within  a  finite 
window  centered  on  the  point  to  be  reconstructed.  The 
rule  is,  among  all  the  rules  based  on  that  window,  the  one 
that  minimizes  the  expected  number  of  incorrectly  recon¬ 
structed  pixels.  Surprisingly,  the  rule  can  be  expressed  in 
the  statistics  of  the  observed  image  only.  Therefore,  to 
use  the  rule,  we  do  not  need  to  know  the  prior  distribution 
of  the  images. 

If  we  assume  stationzurity  we  can  apply  the  rule  in  an 
empirical  Bayes  fashion.  All  the  necessary  parameters 
C2m  be  estimated  from  the  observed  image.  No  training 
images  are  required. 

Our  examples  suggest  that  the  method  works  well  for 
binary  images  with  a  small  amount  of  noise  if  some  post¬ 
processing  is  applied.  In  these  cases  it  removes  almost 
all  the  noise.  In  binary  images  with  higher  noise  levels  or 
more  details  the  method  still  works  quite  well.  The  results 
are  comparable  to  those  achieved  by  some  other  methods 
in  the  literature.  We  should  point  out  though  that,  al¬ 
though  some  generalizations  are  possible,  our  method  is 
not  yet  applicable  to  such  a  wide  range  of  different  prob¬ 
lems  as  several  of  the  other  methods  are  applicable  to. 
Further  work  is  needed  to  explore  the  possibilities  to  ex¬ 
tend  the  window  based  method  to  other  problems. 
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Figure  2  (right):  kitty,  top  to  bottom: 

—  original  (630  x  390  pixels); 

—  with  20%  errors; 

—  reconstruction,  with  2%  errors. 
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Abstract 

In  this  paper,  we  use  simulations  and  some 
theory  to  show  that  many  of  the  standard 
techniques  used  to  estimate  band-to-band 
misregistrations  in  multivariate  imagery  are 
biased.  These  biases,  although  typically 
small,  become  important  because  they  are 
much  larger  than  the  standard  errors  of  the 
estimators  obtained  for  most  image  pairs, 
which  often  consist  of  sample  sizes  of  the 
order  of  10®  or  10®.  We  develop  an  em¬ 
pirical  method  for  effectively  correcting  the 
biases  in  some  of  the  methods.  Typically, 
this  enables  us  to  estimate  misregistrations 
to  within  about  1/ 100th  of  a  pixel. 


1  Introduction 

In  recent  years,  quite  a  few  papers  have  been 
published  in  the  remote  sensing  and  image 
processing  literatures  on  methods  for  esti¬ 
mating  band-to-band  misregistrations  in  mul¬ 
tivariate  imagery.  Accurate  band-to-band 
registration  is  important  for  the  multivari¬ 
ate  analysis  of  such  imagery,  because  a  ba¬ 


sic  assumption  is  that  all  components  of  a 
vector  of  spectral  values  refer  to  the  same 
ground  location.  This  is  reflected  in  spec¬ 
ifications  for  various  airborne  and  sateUite 
multispectral  scanners,  which  typically  re¬ 
quire  that  the  bands  be  registered  to  within 
0.1  or  0.2  pixels. 

Most  of  the  methods  used  to  estimate  band- 
to-band  misregistrations  are  either  cross¬ 
covariance-based  or  Fourier-based.  In  this 
paper,  we  show  that,  when  these  methods 
are  applied  to  remotely  sensed  imagery,  both 
usually  give  biased  estimates  of  the  misreg¬ 
istrations,  the  former  because  of  inadequate 
interpolation  procedures  and  the  latter  be¬ 
cause  they  do  not  account  for  the  presence  of 
aliasing  in  the  data.  We  describe  a  Fourier- 
based  method  which  accounts  for  aliasing 
and  which,  for  a  variety  of  512  x  512  image 
pairs,  gives  misregistration  estimates  with 
standard  errors  in  both  horizontal  and  ver¬ 
tical  directions  of  less  than  1/ 100th  pixel! 
Because  of  space  limitations,  only  an  out¬ 
line  of  the  work  is  given  here.  More  exten¬ 
sive  descriptions  can  be  found  in  Berman  et 
al  (1990,  1992). 

Much  of  the  theory  rests  on  essentially  one- 
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dimensional  ideas.  Consequently,  we  deal 
first,  in  Section  2,  with  one  dimensional  im¬ 
ages,  that  is  time  series  data.  This  is  ex¬ 
tended  in  Section  3  to  two  dimensional  im¬ 
ages. 


2  One-Dimensional  Images 
or  Time  Series 

Suppose  we  observe  two  time  series  { 
j  =  1,2  satisfying  the  relationships 

Yj{t)  =  0^-1-  djSit  +  Lj)  -I-  (2.1) 

t=l,. .  .,NJ=1,2,  where  S{i)  (the  “signal”) 
and  the  €j(t)  (the  “noise”)  are  assumed  to  be 
weakly  stationary  processes  that  are  mutu¬ 
ally  uncorrelated  with  E(€j(t))  =  0.  The  pa¬ 
rameter  of  interest  is  D  =  L2  —  L\.  Because 
pixel  values  obtained  from  sensors  such  as 
cameras  are  usually  integrals  of  brightness 
values  over  a  region  corresponding  approxi¬ 
mately  to  the  pixel,  we  can  further  assume 
that  approximately 

^(t)  =  /  X(u)du,  (2.2) 
Jt-i 

where  A’(u)  is  itself  a  continuous  weakly  sta¬ 
tionary  process. 

A  naive  estimator  of  D,  used  widely  in  re¬ 
mote  sensing,  is  obtained  by  finding  the  max¬ 
imum  of  the  cross-covariance  function  of  the 
two  time  series.  If  7j,(t)  =  cov(5(u),S(u  -f 
t)),  we  see  from  (2.1)  that  cov(Yi{u),Y2{u  + 
t))  =  +  D),  which  (assuming  a 

unique  maximum)  is  maximised  when  t  = 
-D.  However,  because  the  data  are  not  con¬ 
tinuously  observed,  we  can  estimate  7,(t)  di¬ 
rectly  only  for  integer  t.  Hence,  if  D  is  non¬ 
integer,  we  need  to  interpolate  our  estimates 


of  7»(t)  in  the  vicinity  of  its  maximum  to 
estimate  it.  The  appropriate  interpolator  is 
highly  data-dependent.  Using  simulations, 
we  have  found  that  this  often  leads  to  esti¬ 
mates  with  a  bias  of  about  0.1  of  a  pixel;  see 
Berman  et  al(1992.  Section  3). 

More  sophisticated  estimation  procedures  can 
be  based  on  the  Fourier  transform.  Let 

Fj{u^)  =  (27rJV)-5Ei^iy,(t)e‘^,  (2.3) 

’(<*’u  =  2TrufN,u  =  1, . . . ,  [iV/2])  denote  the 
discrete  Fourier  transform  of  series  j,  and  let 
6{uJu)  denote  the  phase  difference  between 
the  two  series  at  frequency  Wu  (note  that 
tf(-u;„)  =  -^(u;„)).  If  either  (a)  D  =  K/2, 
where  K  is  an  integer,  or  (b)  there  is  no 
aliasing  of  the  data,  then  it  can  be  shown 
that,  in  large  samples  ^(wu)  ~  Z?-j-27rm(u;„), 
where  m{uju)  is  that  integer  ensuring  that 
H<^u)  €  (-’T,  ir].  Note  that,  if  \D\  <  1, 
which  usually  is  the  case  with  remotely  sensed 
data,  m(u>u)  =  0.  Hamon  and  Hannan  (1974, 
Section  2)  assert  that  (provided  that  there  is 
no  aliasing)  the  asymptotically  optimal  es¬ 
timator  of  D  maximizes 

So<«<N/2’^^(‘^u)cos(0(w„)  -  Du^),  (2.4) 
where 

II^(w„)  =  or^(t^u)/(l  -  <T^(a;„))  (2.5) 

and  (r*(wu)  is  the  coherence  between  the  two 
series.  Since  the  coherence  is  usually  un¬ 
known,  it  needs  to  be  estimated  from  the 
data;  see  Hamon  and  Hannan  (1974)  for  de¬ 
tails.  Hannan  and  Thomson  (1988)  consider 
the  behaviour  of  (2.4)  and  other  asymptoti¬ 
cally  equivalent  estimators  under  low  signal- 
to-noise  scenarios.  It  is  also  worth  noting  (as 
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Hannan  (1975)  and  Chan  et  al  (1978)  have) 
that,  when  |i?|  <  1  and  the  noise  is  small, 
majumising  (2.4)  is  approximately  equiva¬ 
lent  to  minimising 

~  DUu}^‘  (2-6) 

Of  course,  the  advantage  of  (2.6)  is  that  it 
has  an  explicit  solution. 

Unfortunately,  the  presence  of  edges  in  im¬ 
ages  (e.g.  rivers,  roads,  fractures,  cell  or 
property  boundaries)  means  that  frequen¬ 
cies  higher  than  half  the  sampling  rate  are 
often  present,  in  which  case  the  data  are 
aliased.  This  manifests  itself  in  biases  in 
for  various  If  there  is  no  alias¬ 
ing  and  |X)|  <  1,  then  in  large  samples, 
A(a;„)  =  tf(u;„)/wu  (the  phase  delay)  ~  D. 
As  an  experiment,  we  generated  512  pairs 
of  time  series  from  real  data  satisfying  (2.1) 
and  a  discrete  approximation  to  (2.2).  For 
each  pair,  =  101  and  D  =  0.2.  Further 
details  can  be  found  in  Berman  et  al  (1992, 
Section  3).  Fig.  1  shows  the  means  (plus 
and  minus  one  standard  deviation)  of  the 
phase  delays  for  the  512  data  sets  at 


the  (N/2]  =  50  positive  frequencies  given  af¬ 
ter  (2.3).  Biases  are  clearly  present  at  high 
frequencies  and  at  very  low  frequencies.  The 
latter  are  due  to  the  fact  that  the  data  are 
not  periodic  at  the  boundaries,  and  can  be 
largely  corrected  by  “tapering”  (Bloomfield, 
1976,  Section  5.2);  see  Berman  et  al  (1992, 
Section  3)  for  details. 

The  high  frequencies  are  due  to  aliasing  in 
the  data.  It  can  be  shown  that,  under  mild 
regularity  conditions,  6{(J)  will  converge,  as 
N  oo,  to  0(a;),  the  phase  of  the  function 

Eiu)  =  (2x)-’EJf +  D)e-^^  (2.7) 


=  -I-  2x1)  (2.8) 

where 

Mu)  =  (2x)-*  r  7.(0e‘‘*^  (2.9) 

•/— oo 

denotes  the  spectrum  of  the  signal.  Depend¬ 
ing  on  the  nature  of  7,(t)  and  /,(u;),  it  will 
sometimes  be  convenient  to  use  (2.7)  to  com¬ 
pute  6(u)  and  sometimes  (2.8).  The  “un¬ 
biasedness”  of  the  cases  (a)  D  =  K/2  and 
(b)  no  aliasing  of  the  data  (i.e.  Mu)  = 
0,  |u»|  >  X,)  follow  easily  from  (2.8).  We 
have  computed  (2.7)  or  (2.8)  for  a  range  of 
values  of  D  £  (—1,1)  and  for  a  variety  of 
autocorrelation  functions.  Typically,  they 
asymptote  to  D  as  c>;  — »  0  and  converge  to 
0  as  — »  X.  A  theoretical  explanation  for 
this  phenomenon  is  given  in  a  Proposition  in 
Berman  et  al  (1992,  Section  4).  As  an  ex¬ 
ample,  Fig.  2  shows  the  phase  delay  when 
D  =  0.2,cot7(X(u),X(«  -b  1))  =  pl‘l,  and  p 
=  .99  (solid  line),  .5  (dots)  and  .1  (dashes). 


Fig.l 
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Note  how  in  Fig.  2  (and  in  Fig.  1  if  we  taper 
appropriately)  the  estimates  of  the  phase  de¬ 
lay  are,  for  practical  purposes,  unbiased  be¬ 
low  a  cutoff  frequency  (which  will  be  appli¬ 
cation  dependent).  When  \D\  <  1,  a  practi¬ 
cal  estimate  of  it  can  be  obtained  by  taking  a 
weighted  mean  of  the  phase  delay  estimates 
at  frequencies  less  than  the  cutoff  frequency 
(assuming  we  can  obtain  a  good  estimate  of 
it),  where  the  weights  are  inversely  propor¬ 
tional  to  the  error  variances  of  the 
corresponding  phase  delay  estimates. 


PnqMpncr  (ki  Mdtant} 

Fig.  2 


However,  we  have  found  that,  when  |£>|  < 
1,  a  very  good  empirical  approximation  to 
many  phase  delay  curves  is  given  by  the  for¬ 
mula 


A(u)  =  D-  yl(c®"  -  1).  (2.10) 

Typically  B  >  0.  We  estimate  the  param¬ 
eters  by  non-linear  weighted  least  squares, 
where  the  weights  are  again  inversely  pro¬ 
portional  to  the  error  variances;  see  Berman 
et  al  (1992,  Section  5)  for  further  details. 
We  can  interpret  this  as  an  approximate 


means  of  finding  the  cutoff  frequency.  Fur¬ 
ther,  since  we  are  interested  in  estimating 
D,  precise  estimation  of  A  and  B  is  not  im¬ 
portant. 

When  1D|  >  1,  we  have  found  it  best  to  es¬ 
timate  the  integer  part  of  D  first,  using  a 
method  such  as  the  minimisation  of  (2.4), 
then  shift  one  time  series  the  relevant  num¬ 
ber  of  integer  units  with  respect  to  the  other, 
remove  the  non-overlapping  parts  of  the  re¬ 
sulting  data  sets,  and  finally  estimate  the 
fractional  part  using  (2.10)  in  conjunction 
with  weighted  least  squares  estimation. 


3  Two-Dimensional  Images 

Much  of  the  one-dimensional  theory  above 
is  readily  extendible  to  two  dimensions,  and 
hence  applicable  to  the  misregistration  prob¬ 
lem.  Let  Dx  and  Dy  denote  the  shifts  in  the 
X  and  y  directions  respectively.  Equations 
(2.1)  and  (2.2)  extend  in  an  obvious  way. 
Again,  with  the  aid  of  simulated  data,  we 
have  sometimes  been  able  to  demonstrate 
biases  in  cross-covariance-based  methods  of 
about  0.1  of  a  pixel.  The  Fourier  theory  also 
extends  easily.  For  M  x  N  images,  the  two 
dimensional  Fourier  transform  of  image  j  is 

Fi(w»,Xv)  OC 

(3.1) 

(u>„  =  2nufMfXv  =  2nvfN,  u  =  -M/2, . . ., 
M/2-l,v  =  -N/2,...,N/2-l).  When 
aliasing  is  absent,  the  phase  difference  in 
large  samples  satisfies 
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Xv)  ~  WuD,  +  XvDy  +  2Tm(tj„,  Xv), 

(3.2) 

where  m{u>uiXv)  is  that  integer  choeen  to 
ensure  that  ^(wu,Xw)  €  If  \Dx\  + 

\Dy\  <  l,m(u;u,  x„)  =  0.  For  the  time  be¬ 
ing,  we  shall  assume  this  to  be  so.  When 
aliasing  is  present,  there  are  two  options.  In 
the  first  option,  we  can  find  a  rectangular 
region  around  the  origin  for  which  (3.2)  is 
a  good  approximation.  This  involves  find¬ 
ing  cutoff  frequencies  in  both  the  x  and  y 
directions.  Let 

Ax(u;„,Xt,)  =  {^(a;„,-Xv)  +  ^(wu,Xv)}/2a>„, 

(3.3) 

u  =  l,...,A//2  -  l,t;  =  1,. ..  ,fV/2  -  1. 
For  those  frequencies  for  which  (3.2)  is  a 
good  approximation,  it  is  easily  seen  that,  in 
large  samples,  Ax(a>„,  Xu)  ~  Dx.  A  suitably 
weighted  mean  of  the  Ax(wu,Xu)’s  gives  an 
appropriate  estimator  of  Dx  •  An  analogous 
procedure  holds  for  estimation  of  Dy.  Under 
mild  assumptions,  the  two  estimators  are 
approximately  uncorrelated.  See  Berman  et 
al  (1990,  Section  3)  for  further  details. 

A  second,  more  appealing  option,  which  we 
now  use  routinely,  is  the  following.  First, 
we  assume  separability  of  the  autocovari¬ 
ance  function  of  the  signal,  i.e.  cov{S{s,i), 
S{s  +  u,i  +  v))  =  7x(u)7y(v),  where  5(a,t) 
denotes  the  signal  at  (s,t).  It  follows  easily 
that  the  limiting  phase  difference,  d(w„,  Xv), 
will  satisfy  ^(u;„,Xi;)  =  ^a:(wu)  +  ^»(Xv),  and 
hence  that,  in  large  samples, 

Ax(Wu,  Xv)  ~  ^x(‘*’u)/Wu*  (3-4) 

Note  that  the  right-hand  side  of  (3.4)  is  in¬ 


dependent  of  XtJ*  and  is  also  the  phase  de¬ 
lay  of  a  ONE-DIMENSIONAL  time  series, 
which  we  have  found  is  well  modelled  by 
(2.10).  Our  solution  therefore  is  to  compute 


Ax(t*^u)  —  ^Ax(i»;u,  Xt))/2v'^ii  (3-5) 

where  =  Far{Ax(wu,Xt/)}-  Then 
Var(Ax(w„))  =  Typically, 

needs  to  be  estimated  via  the  residuals  from 
some  local  smoothing  procedure.  Finally, 
we  fit  a  model  of  the  form  (2.10)  applied  to 
Ax(wu)  by  non-linear  weighted  least  squares, 
where  the  weights  are  proportional  to  the 
inverse  of  V'ar(Ax(a;„)).  If  the  various  as¬ 
sumptions  underlying  our  model  are  correct, 
the  residual  variance  from  this  fit  should  be 
about  1.  We  should  stress  however  that 
the  assumption  of  separability  of  the  au¬ 
tocovariance  function  is  not  critical  to  the 
success  of  this  method.  It  can  be  inter¬ 
preted  as  an  indirect  method  of  finding  two- 
dimensional  cutoff  frequencies.  When  |Dx|  + 
iDj,)  >  1,  we  can  estimate  Dx  and  Dy  to  the 
nearest  integer,  using  a  two-dimensional  ver¬ 
sion  of  the  Hamon-Hannan  procedure  or  by 
finding  where  the  cross-covariance  is  max¬ 
imised,  shifting  the  images  the  appropriate 
number  of  pixels,  trimming  them  and  using 
the  above  procedure  to  estimate  the  frac¬ 
tional  parts. 

We  have  applied  this  method  to  a  simulated 
image  pair,  each  of  size  200  x  200,  in  which 
there  is  no  noise  and  for  which  Dx  =  0.2 
and  D,  =  0.4.  Details  of  the  construction 
of  these  images  can  be  found  in  Berman  et 
al  (1990,  Section  3).  Our  estimates  (and 
their  standard  errors)  are  D  =  .198  (.005), 
D  =  .398  (.007).  We  have  also  applied  the 
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method  to  a  number  of  real  remotely  sensed  Hamon,  B.V.  and  Hannan,  E.J.  (1974)  Spec- 
image  pairs,  and  in  most  cases  obtained  com-  tral  estimation  of  time  delay  for  dispersive 
parable  results.  One  example  can  be  found  and  non-dispersive  systems.  Appl.  Statist, 
in  Berman  et  al  (1990);  others  will  be  pub-  23,  134-142. 
lished  elsewhere.  In  some  cases,  however, 

the  method  breaks  down.  This  occurs  when  Hannan,  E.J.  (1975)  Measuring  the  velocity 
the  two  images  are  not  highly  correlated  (in  of  a  signal.  In  Perspectives  in  Probability 
our  experience,  when  the  maximum  cross-  and  Statistics  (ed.  J.M.  Gani).  Sheffield; 
correlation  between  the  two  images  is  less  Applied  Probabilty  Trust, 
than  about  0.7).  For  remotely  sensed  im¬ 
agery,  this  is  typically  because  the  wave-  Hannan,  E.J.  and  Thomson,  P.J.  (1988)  Time 
lengths  at  which  the  two  images  are  recorded  delay  estimation.  J.  Time  Series  Analysis,  9, 
are  sufficiently  far  apart  that  the  signals  are  21-33. 
no  longer  linearly  related  and  so  the  two- 
dimensional  version  of  (2.1)  no  longer  holds. 

Consequently,  care  in  the  use  of  the  method 
described  here  is  required. 
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Abstract:  Magnetic  resonance  imaging  (MRI)  is  cur¬ 
rently  the  most  sensitive  modality  for  detecting  and  dif¬ 
ferentiating  pathophysiologic  events.  Transverse  relax¬ 
ation  times  (T2)  provide  quantitative  information  useful 
for  evaluating  a  number  of  diseases  (Dumitresco  et  al, 
(1986)).  In  MRI  the  observed  T2  signal  is  modeled  by 
m{t)  =  where  the  reciprocal  of  qj  is 

the  corresponding  expected  relaxation  time.  We  con¬ 
sider  maximum  likelihood  estimation  of  the  parameters 
A,  Sj,  Uj,  j  =  1,...,A:  under  the  assumption  that  the 
number  of  excited  protons  measured  follows  a  Poisson 
distribution.  A  computationally  simple  method  for  se¬ 
lecting  k,  the  number  of  exponential  components  in  the 
model,  is  proposed. 

1.  INTRODUCTION. 

Estimation  of  the  individual  parameters  in  the  sum 
of  exponential  components  is  a  long  standing  statistical 
problem  (Niedzwiecki  and  Simonoff,  (1990))  for  which  a 
solution  is  of  fundamental  interest  in  the  field  of  MRI. 
The  multicomponent  exponential  model  associated  with 
the  T2  curve  is  derived  in  Section  2.  Section  3  pro¬ 
vides  an  overview  of  the  fundamental  estimation  prob¬ 
lems  encountered  with  this  model.  Clearly,  estimation 
of  the  model  parameters  is  facilitated  by  knowledge  of 
the  correct  model.  In  Section  4  standard  model  selection 
procedures  are  examined  and  a  new  method  for  model 
selection  is  proposed.  Finally,  in  Section  5  the  methods 
developed  are  applied  to  MRI  data  from  an  in  vivo  study 
of  female  breast  tissue. 

2.  THE  MODEL. 

It  is  reasonable  to  assume  the  initial  number  of  mag¬ 
netized  hydrogen  molecules,  A(0),  in  the  multi-compart¬ 
ment  system  follows  a  Poisson  distribution  with  param¬ 
eter  A.  Further,  we  assume  that  the  relaxation  time  of  a 
molecule  follows  an  exponential  distribution  with  param¬ 
eter  for  j  =  1, . . . ,  I:.  Let  Xj{t)  denote  the  number  of 
excited  molecules  at  time  t  of  compartment  j.  Then  the 
conditional  joint  distribution  of  Xi{t), . . .  ,Xk{t)  given 
A(0)  is  multinomial  with  parameters  A,pi(t), . . .  ,pt(t). 
where  pj(t)  =  5jexp{-ajt}  and  =  MRI, 

the  random  variable  of  interest  is  y(<)  =  a  =  i  (^)> 
the  scaled  signal,  where  a  is  a  real-valued  constant.  We 


assume  a  =  1  for  the  purpose  of  this  paper  (consistent 
with  cited  authors),  however,  this  parameter  deserves 
future  investigation.  It  then  follows,  that  the  marginal 
distribution  of  y(<)  is  Poisson  with  mean  function 

'"(0  = 

3.  ESTIMATION  OF  T2  RELAXTION  TIMES. 

Given  independent  observations  of  y(f)  at  times 
we  focus  on  estimation  of  the  parameters  A,  6j, 
Oj  for  J  =  1, . . . ,  k,  with  Oi  >  . . .  >  oj.  and  ^  Sj  =  1. 
The  expected  T2  relaxation  times  are  given  by  l/aj, 
J  =  1, . . . ,  it  and  are  the  primary  parameters  of  interest. 
The  maximum  likelihood  estimates  can  be  obtained  by 
iteratively  reweighted  least  squares  with  weight  function 
l/m(<)  (see  Frome,  Kutner  and  Beauchamp  (1973),  del 
Pino  (1989),  Green  (1984)). 

Sandor  et  al  (1988)  derive  the  m.l.e.  for  a  slightly 
different  model  of  the  decay  curve  of  the  transverse  relax¬ 
ation.  They  etssume  the  observations  are  from  a  Poisson 
random  variable  with  mean  function  given  by  fj  m[t)dt 
where  /  =  (fj_i,fj  +  i)  denotes  the  time  interval.  Their 
formulation  eissumes  the  observations  represent  an  ac¬ 
cumulated  response.  Unfortunately  the  investigation  of 
this  model  was  limited  to  equally  spaced  time  intervals 
and  cannot  be  distinguished  from  a  model  based  on  time 
specific  signal  intensity.  For  the  unequally  spaced  data 
in  our  example  the  accumulated  model  provides  a  very 
poor  fit. 

Although  theoretically  the  above  estimates  are  ob¬ 
tainable;  realistically,  solving  the  maximum  likelihood 
equations  is  very  difficult.  The  problems  with  fitting  the 
sum  of  exponential  components  are  well  documented  in 
Bates  and  Watts  (1988)  and  Seber  and  Wild  (1989).  One 
major  problem  is  that  of  parameter  redundancy;  in  other 
words,  models  of  different  order  produce  similar  results. 
Based  on  Reich’s  (1979)  measure  for  parameter  redun¬ 
dancy  one  cannot  reliably  estimate  the  parameters  of  a 
biexponential  model  if  the  ratio  of  the  decay  rates  is  less 
than  .2.  His  measure,  however,  was  developed  for  an  ad¬ 
ditive  error  model  with  equally  spaced  observations.  In 
MRI  one  expects  the  mean  times  of  the  long  and  short 
components  to  differ  by  less  than  a  factor  of  five,  so 
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Reich’s  measure  would  imply  that  estimation  of  these 
means  is  futile.  Sandor  et  al  (1988)  show  that  for  the 
Poisson  model  one  can  reliably  estimate  both  the  short 
and  long  expected  times  when  the  ratio  of  the  two  is  as 
low  as  1.1. 

4.  HOW  MANY  COMPARTMENTS? 


AIC  selection  criterion  is  that  both  the  monoexponen¬ 
tial  and  biexponential  models  must  be  fitted  to  the  data. 
Frequently  when  the  incorrect  model  is  fitted,  achieving 
convergence  of  the  optimization  routines  is  difficult,  thus 
these  methods  are  undesirable.  In  the  next  section  we 
propose  a  method  of  model  selection  which  does  not  re¬ 
quire  fitting  the  models. 


Our  objective  in  this  paper  is  to  introduce  a  nonin¬ 
teractive  method  for  identifying  the  number  of  compart¬ 
ments  which  should  be  included  in  the  model  (again  we 
examine  either  one  or  two  compartment  models) .  To  this 
end,  we  have  considered  two  approaches:  use  of  standard 
model  selection  procedures  for  additive  error  models  and 
development  of  a  graphical  method  for  model  selection. 

4.1  Standard  Approaches  for  Model  Selection. 

Assume  Y(t)  is  modeled  by  y(<,)  =  m(t,)-f  f,  where 
f,-  for  i  =  1, . . . ,  71  are  independent  Normal  random  vari¬ 
ables  with  zero  mean  and  variance  cr?.  Hurvich  and  Tsai 
(1989)  propose  a  corrected  form  of  Akaike’s  Information 
Criterion  (AIC)  (Akaike,  1973)  for  purposes  of  model 
selection  in  both  linear  and  nonlinear  regression  when 
dealing  with  small  sample  sizes.  For  nonlinear  regres¬ 
sion  with  nonconstant  variance  their  criterion  is 


AlCe  =  n  In  -t-  71 


1  -b  m/n 
1  -  (m  -b  2)/n’ 


where  is  the  maximum  likelihood  estimate  of  <r^  and 
m  is  the  number  of  parameters  in  the  model.  The  model 
selected  is  the  model  which  minimizes  AICc.  For  16  ob¬ 
servations  (which  is  the  number  of  observations  in  our 
examples)  (f\^ >  1.40  to  select  a  two-compartment 
model  based  on  this  criterion.  For  the  uncorrected  AIC 
a  two-compartment  model  is  indicated  if  >  1.28. 

The  results  of  simulations  based  on  both  model  selection 
procedures  are  given  in  Table  1. 


Table  1.  Model  Selection  for  Additive  Error  Model 

%Correct 


m(<) 

AIC 

AlCe 

300(.5e-'/30  +  .5c-‘^i30) 

95 

95 

300(.5e-®/2o^.  ,5g-</i50 

92 

88 

500(.8e-‘/30-(-  .2e-‘/’3°) 

100 

100 

450e“‘'^^° 

80 

88 

The  above  percentages  are  out  of  25  replications. 


In  addition  to  the  dependence  on  an  additive  er¬ 
ror  model,  a  major  drawback  to  the  AIC  or  corrected 


4.2  Suggested  Approach  for  Model  Selection. 

Consider  the  biexponential  model  observed  at  times 
ranging  from  20  to  300  units.  The  signal  attributed  to 
the  “short”  component  will  decay  more  rapidly  than  the 
signal  of  the  “long”  component;  therefore,  the  long  com¬ 
ponent  should  dominate  the  signal  at  the  larger  times. 
This  is  the  fundamental  argument  given  for  obtaining 
estimates  of  the  expected  T2  times  by  the  method  of 
peeling  (see  Bates  and  Watts  (1988)).  If  the  true  mean 
function  m{t)  contains  more  than  one  exponential  term 
but  a  monoexponential  model  is  fitted  to  the  function, 
the  monoexponential  decay  rate  is  a  monotonic  decreas¬ 
ing  function  of  time.  More  specifically,  setting 

and  solving  for  /?o  we  obtain 

Figure  1  is  a  plot  of  /?o(0  for  expected  T2  times  of 
30  and  125  with  three  different  mixtures  (20%  long  and 
80%  short;  50%  long  and  50%  short;  80%  long  and  20% 
short);  Figure  2  presents  /?o(0  with  expected  T2  times 
of  30  and  65  for  the  same  mixtures.  For  MRI  data  we 
would  be  interested  in  the  /?o(f)  curve  up  to  time  300. 
Clearly,  when  the  short  component  contributes  at  least 
50%  of  the  signal  this  curve  is  distinguishable  from  the 
constant  curve  exhibited  by  a  one  compartment  model. 
As  expected,  when  the  long  component  dominates  the 
signal  it  would  be  difficult  to  make  a  distinction  between 
a  one  and  two  compartment  model.  If  we  observe  these 
patterns  in  sample  data,  then  we  should  be  able  to  dis¬ 
tinguish  between  one  and  two  compartment  models. 

Furthermore,  the  authors  wish  to  note  that  even 
when  the  signal  is  dominated  by  the  long  component  in 
a  two  compartment  model  and  there  exists  reasonable 
separation  of  the  two  expected  relaxation  times,  care 
should  be  taken  in  using  methodologies  which  attribute 
the  tail  data  to  the  long  component.  As  shown  in  Fig¬ 
ure  1,  the  curve  with  an  80%  decay  rate  of  .008  appears 
flat  by  time  250,  but  the  value  of  /?o(f)  at  this  point  is 
.00889  (/7o(400)  =  .00856).  This  11%  increase  could  un¬ 
derstandably  cause  bias  in  estimates  based  on  the  mono¬ 
exponential  model  at  the  larger  times  and  this  is  a  model 
which  is  dominated  by  the  long  component. 


Automatic  MRI  595 


As  a  computationally  simple  estimate  of  Po{t)  we 
propose  using  the  slope  from  a  simple  linear  model  fit¬ 
ted  to  consecutive  observations  ■  •  • ,  where  z,  = 
lny(t,)  for  »  =  1, . . . ,  n.  In  other  words,  as  an  estimate 
of  /?o(<)  at  t;  =  h  we  suggest 


_ *11 


where  z,  = 

The  choice  of  q  will  depend  on  the  variation  ex¬ 
pected  in  the  data.  For  problems  with  large  variation  we 
do  not  want  an  estimate  of  ^o(0  to  be  based  on  just  two 
or  three  observations;  however,  if  q  is  chosen  too  large 
then  we  are  unable  to  identify  the  change  in  the  decay 
rate  (e.g.  consider  the  extreme  case  when  g  =  n  —  1). 

In  Figure  3,  the  above  estimates  of  Po(t)  for  simu¬ 
lated  Poisson  data  from  a  two  compartment  model  with 
mean  m(t)  =  500(.8e~*/^°  +  .2e~*^^°)  and  a  one  com¬ 
partment  model  with  mean  m(<)  =  1000e“‘/'’°  are  plot¬ 
ted  over  tj.  For  these  examples  the  slope  is  computed 
from  6  consecutive  points  (i.e.  q  =  5).  Clearly,  the 
two  compartment  model  is  distinguished  from  the  one 
compartment  model.  Future  investigation  into  the  use 
of  tests  of  randomness  to  distinguish  one  and  two  com¬ 
partment  models  is  warranted. 

An  added  advantage  of  our  procedure  is  that  the 
last  estimate  of  0o{t)  can  be  used  as  an  initial  estimate 
of  the  long  expected  T2  time  for  input  into  the  optimiza¬ 
tion  routine  used  to  solve  for  the  maximum  likelihood 
estimates.  The  final  slope  estimate  corresponds  to  the 
estimate  of  the  long  component  obtained  by  the  method 
of  peeling  when  the  same  number  of  observations  are 
used.  As  previously  stated,  this  initial  guess  will  be  an 
underestimate  of  the  long  expected  time  since  the  short 
component  still  contributes  substantially  to  the  decay 
rate  at  time  200  (see  Figures  1  and  2).  In  our  example, 
the  estimate  of  /?o(0  at  =  187,5  is  .008666  which 
would  provide  an  initial  guess  of  115.39  for  the  long  ex¬ 
pected  time  of  130. 


5.  EXAMPLE. 


The  above  methods  were  applied  to  MRI  data  from 
an  in  vivo  study  of  the  female  breast.  Our  suggested 
method  of  order  selection  indicates  that  the  signal  from 
the  ductal  regions  of  the  breast  is  best  modeled  by  the 
monoexponential  decay  curve,  whereas  the  data  from  the 
lipid  region  suggests  a  two  compartment  model.  Figure 
4  presents  data  observed  at  two  sites  in  the  lipid  region 
of  one  patient  and  the  estimated  biexponential  decay 
curves. 
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Introduction 

Computation  of  relative  risk  of  mortality  or 
morbidity  associated  with  exposure  to  some  causative 
agent  proceeds  under  the  implied  assumption  that  the 
exposure  is  measured  with  precision.  In  reality, 
accurate  exposure  information  may  or  may  not  be 
obtainable.  For  instance,  many  surveys  that 
determine  the  smoking  status  of  mdividuals  in  a 
population  appear  to  cfo  so  with  acceptable  accuracy. 
However,  occupational  exposure  to  many  toxic 
substances  may  be  much  more  difficult  to  ascertain 
with  similar  accuracy.  It  is  simply  not  practical  to 
measure  the  actual  exposure  directly  on  a  broad  scale 
or  on  a  continuous  basis.  Consequently,  some  form 
of  indirect  or  surrogate  measurement  must  be 
employed. 

A  number  of  techniques  have  been  developed  to 
assign  exposure  levels  when  physical  closure 
measurements  are  absent  [Checkoway,  1986;  Dement 
et  al,  1983].  For  the  most  part  these  techniques  rely 
on  estimated  exposure  levels  related  to  job  categories, 
to  location  of  workers,  and/or  to  extrapolation  from 
present  day  measurements  of  exposure  to  previous 
practices  of  manufacture.  Despite  the  great  amount 
of  effort  devoted  to  divising  and  improving  these 
techniques,  the  problem  remains  that  inevitably  a 
proportion  of  individuals  classified  as  exposed  by  the 
surrogate  actually  have  little  or  no  real  exposure  and 
at  the  same  time  a  proportion  of  those  classifed  as 
not  exposed  nevertheless  have  substantial  exposure 
[Gerin  et  al,  1983,  1985;  Siemiatycki  et  al,  1981;  Hoar 
et  al,  1980;  Dosemeci  et  al,  1990;  Esmen,  1979; 
Greenberg  et  al  1981,  1983) 

Only  a  few  studies  attempt  to  estimate  the 
proportion  of  individuals  misclassifled.  Williams  et  al 
[1984]  observe  that  assignment  by  job  title  of 
’’Undertaker"  to  a  catego^  of  TExposed  to 
Formaldehyde"  would  result  in  ^proximately  25% 
misclassification.  Schulz  et  al  (1983>|  report  that  30% 
of  individuals  who  were  classified  according  to 
seriousness  of  complications  had  been  misclassified. 
Millar  [1?86)  reports  approximately  12% 
misclassification  of  individuals  who  were  classified  by 
self-reported  heights  and  weights.  Millar’s  observation 
seems  especially  relevant  because  certainly  it  is  likely 
that  individuals  would  know  their  heights  and  weights 
with  greater  accuracy  than  their  possible  exposure  to 
toxic  products. 

Even  where  physical  exposure  data  exist 
classification  may  be  inaccurate  because  such  data  are 
associated  with  a  location,  not  an  individual  and 
workers  typically  move  about  their  work  area  (Berode 
et  al,  19^a,  1980b;  Boillat  et  al,  1986;  Cope  et  al, 
197^  Sterling,  1964). 

The  existence  of  these  misclassifications  and  the 
fact  that  commonly  used  methods  of  risk  analysis 
essentially  treat  the  data  as  if  there  were  no 
misclassification  raises  two  questions: 


1.  To  what  extent  does  the  imprecise  classification  of 
individuals  affect  the  calculated  relative  risk  (or 
Apparent  Relative  Risk)l 

2.  Is  it  possible  to  determine  from  the  Apparent 
Relative  Risk  what  the  True  Relative  Risk  might 
be  under  reasonable  estimates  of  the  amounts  of 
misclassification? 

Copeland  et  al  [1977]^  and  Goldberg  [1975]  present 
numerical  examples  of  the  effects  of  misclassification. 
Barron  [1977]  gives  a  formula  for  the  true  relative 
risk  in  terms  of  conditional  probabilities  which  are 
related  to  the  prevalence,  sensitivity,  and  specificity 
for  the  DOTulation  at  risk  and  the  decedents.  Flegal 
et  al  [1986]  gives  a  forniula  for  the  true  relative  risk 
in  terms  of  the  apparent  relative  risk  and  the 
prevalence,  sensitivity,  and  specificity. 

Approaches  to  the  analysis  of  misclassification  bias 
generally  parameterize  the  problem  in  terms  of  the 
exposure  prevalence  and  the  sensitivity  and  specificity 
of  the  exposure  classification.  The  bias  resulting 
when  a  confounding  variable  is  present  is  di.scussed  in 
Greenland  [1980)  and  in  Greenland  and  Robbins 
[1985]  primarily  through  numerical  examples.  A  good 
overview  of  the  effects  of  misclassification  is  giveb  in 
Kelsey  et  al  [1986]. 

The  probfem  with  the  traditional  parameterization 
is  that  the  sensitivity,  specificity,  and  proportion 
exposed  are  all  unknown  and  unobservable. 
Estimating  the  sensitivity  and  specificity  essentially 
involves  expressing  a  judgement  about  the  proportion 
of  an  unknown  subpopulation  (those  truly  exposed  or 
not  exposed)  which  has  been  correctly  classified  by 
the  surrogate  exposure  variable.  Instead  we  adopt  ah 
approach  similar  to  that  of  Green  |1983|  and 
parameterize  our  analysis  in  terms  of  the  proportion 
classified  as  belonging  to  the  higher  likelihood  of 
exposure  group  (an  observable  quantity)  and  the 
proportions  of  the  higher  and  lower  likelihood  of 
exposure  groups  which  really  have  high  or  low 

exposure  respectively  (the  predictive  values  of  the 
positive  and  negative  classification).  While  the 
predictive  values  are  conceptually  similar  to  .sensitivity 
and  specificity,  as  Green  |1983]  has  pointed  out,  ft 
may  be  more  feasible  for  an  investigator  to  determine 
or  at  least  estimate  bounds  for  the.se  unknown 
quantities. 

In  the  remainder  of  the  report  we  derive  an 
expression  for  the  true  relative  risk  in  terms  of  the 
apparent  relative  risk,  the  proportion  in  the  high 
exposure  group,  and  the  positive  and  negative 

predictive  values. 

Analysis 

We  start  with  the  familiar  method  of  computing 
relative  risk.  Let  b  be  the  background  probability  of 
occurence  of  the  disease  or  cause  of  death  of  interest 
and  let  t  be  the  true  relative  risk  of  exposed  to 

unexposed.  Let  p  be  the  proportion  of  the 
population  at  risk  who  are  exposed.  Then  b(l-p) 

and  btp  are  the  numbers  of  ca.ses  among  the 
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unexposed  and  exposed  respectively.  (The  absolute 
size  of  the  population  at  risk,  N,  always  cancels  out 
when  the  relative  risk  is  computed.  Therefore  we 
omit  it  for  simplicity  although  the  tables  are  labeled 
and  formulas  presented  as  if^the  factor  N  appeared  in 
each  cell.)  These  data  are  normally  arranged  in  a 
two  by  two  table  as  shown  in  Figure  1  and  the 
relative  risk  is  given  by  the  cross  product  ratio,  which 
is  equal  to  t. 

Next  we  consider  the  case  in  which  the  exposure 
variable  is  imprecisely  classified.  Let  b  and  t  be 
defined  as  before,  and  let  p  be  the  proportion  of  the 
population  at  risk  in  the  higher  likelinood  of  e^osure 
group  and  /  and  h  be  the  (unknown)  proportions  of 
correct  classification  in  the  lower  and  nigner  groups 
respectively.  Now  b(l—p)(l+(l—l)t)  and  bpyl-h+nt)  are 
the  numbers  of  cases  in  the  lower  and  higher  groups 
respectively.  We  arrange  the  data  in  a  two  by  two 
table  as  shown  in  Figure  2.  Taking  the  cross  product 
ratio  we  obtain  the  apparent  relative  risk 


a  = 


{I  -  h)  +  h  t 


/  +  (!-/)/ 

Solving  for  t  (the  actual  risk  of  exposure),  we  obtain 


t  = 


a  I  -  (1  -  h) 
h  ~  a  {\  -  1) 


= 


where  we  define  as  the  misclassificatiqn 

function  which  gives  the  true  relative  risk  that  is 
required  to  be  consistent  with  given  values  of  a,  h, 
and  /.  Undef  the  assumption  that  the  higher  group 
really  is  more  likely  to  oe  emosed  than  the  lower 
group,  0  <  (1-/)  <  h  <  1.  Consequently,  0  <  (1- 
^)  <  /  <  1.  Therefore,  the  numerator  has  a  root  at 
between  0  and  1,  while  the  denominator  has 
a  root  at  /i /(!-/)  which  is  greater  than  1.  Thus  the 
graph  of  t  increases  steamly  from  the  point  (fl- 
«)/7,0),  passes  through  the  point  (1,1),  and  approaches 
a  vertical  asymptote  at  /i/(l-/).  Figure  3  shows  a 
gr^h  of  M(a,.6j.8)  i.e..  for  the  situation  in  which 
of  the  high  likelihood  of  exposure  group  and 
20%  of  the  low  likelihood  of  exposure  group  have 
been  misclassified. 

Next  suppose  that  exposure  classification  is  precise 
but  there  is  a  dichotomous  confounding  factor.  Let  c 
be  the  relative  risk  of  the  confounder  and  s  the 
relative  synergistic  (or  antagonistic)  effect  between  the 
true  exposure  and  the  confounder.  In  other  words, 
persons  exposed  to  the  confounder  but  not  the  agent 
have  c  times  the  risk  of  persons  exposed  to  neither, 
while  persons  exposed  to  both  the  confounder  and 
the  agent  have  cs  times  the  risk  of  persons  exposed 
to  the  agent  alone. 

Let  p  be  the  proportion  of  the  population  at  risk 
who  are  exposed  to  the  agent,  u,  the  proportion 
exposed  to  the  confounder,  and  /  be  the  proportion 
exposed  to  both  the  agent  and  the  confounder.  Then 
the  proportion  exposed  to  the  agent  alone  is  p-f,  the 
confounder  alone  is  u-f,  and  to  neither  is  1-p-u+f. 
These  data  can  be  arranged  in  a  table  as  shown  in 
Figure  4.  The  numl^r  of  cases  arising  in  each 
subgroup  is  shown  in  Figure  5.  We  can  compute 
several  forms  of  risk  ratio  standardized  across  levels 
of  the  confounder.  Let  a  denote  the  computed  SRR. 
Standar-’-zing  to  the  low  exposure  population  group 
gives 


cs{f~u)+p  +  u-  f-  l 

a  =  - t 

c(f-u)+p  +  u-  f-  1 


and  solving  for  t  )delds 

c(f-u)+p  +  u-  f-  l 

t  =  - a  = 

cs(f-u)+p  +  u-  f-  l 


Similarly,  standardizing  to  the  high  exposure 
population  group  yields 


t 


cf-f  +  P 
c  f  s  -  f  +  p 


Finally,  standardizing  to  the  total  population  yields 


t 


c  u  -  u  +  1 

-  a 

c  s  u  -  u  +  1 


= 


In  summary,  the  true  relative  risk  of  exposure,  /,  is 
equal  to  the  apparent  relative  risk,  a  times  a  factor 
which  depends  on  c,  s,  /,  u,  and  p.  If  there  is  no 
synergistic  effect,  i.e.,  s  =  l,  then  as  is  well  known  the 
apparent  relative  risk  is  identical  to  the  true  relative 
risk  for  all  values  of  all  the  other  parameters.  We 
call  this  factor  the  synergism  factor  and  denote  it  by 
Sx  where  x  indicates  the  referent  population. 

Finally,  we  consider  the  combined  effect  of 
misclassiiication  and  confounding.  The  distribution  of 
the  population  at  risk  is  the  same  as  in  the  previous 
case  (see  Figure  4),  except  that  Not  Exposed  and 
Exposed  should  now  read  Lower  and  Higher  and  p 
now  denotes  the  proportion  in  the  higher  likelihood 
of  exposure  group.  The  number  of  cases  arising  in 
each  subgroup  is  shown  in  Figure  6. 

Standardizing  to  the  entire  population  and  solving 
for  f  yields  t  =  Sp  Similarly,  standardizing 

to  the  group  with  probable  low  exposure  yields 
r  =  S/  and  standardizing  to  the  group  with 

probable  high  exposure  yields  t  =  S/, 

In  a  practical  situation  an  estimate  of  a  may  be 
obtained  in  the  usual  way  and  a  test  for  heterogeneity 
can  be  used  to  determine  whether  or  not  the 
assumption  that  s  =  l  is  justified,  If  there  is  no 
synergism,  then  the  range  of  possible  values  of  t  can 
be  computed  by  assigning  values  or  plausible  ranges 
of  values  to  h  and  /.  If  there  is  synergism,  then  it  is 
not  appropriate  to  attempt  to  summarize  the  effect  of 
the  agent  in  terms  of  a  single  relative  risk.  Instead  a 
stratified  analysis  should  be  performed. 

Discussion 

Let  us  turn  to  an  example.  A  study  compares  two 
groups  of  individuals,  one  of  them  classified  as  high 
exposed  and  the  other  classified  as  low  exposed  (H 
and  L  re^ctively).  Assume  that  an  apparent  relative 
risk  of  1.0  is  computed  for  the  H  group  as  compared 
to  the  L  group. 

Figure  7  shows  level  curves  of  the  function 

M(i.g>y). 
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It  is  not  unreasonable  to  assume  that  in  a  ^ical 
occupational  health  study  at  least  10%  and  possibty  as 
many  as  40%  of  individuals  are  incorrectly  classified 
[Williams  et  al,  1984-  Fergusson  et  al,  1989-  Schulz  et 
al  ,  1983;  Millar,  19^].  Corresponding  values  of  the 
true  relative  risk  can  be  read  off  for  different 
assumptions  about  misclassification.  For  instance, 
under  the  assumption  of  10%  misclassification  in  each 
category,  an  apparent  relative  risk  of  1.8  would 
correspond  to  a  true  relative  risk  of  approximately 
2.1.  If  it  is  assumed  that  as  many  as  30%  of 
individuals  were  misclassified  in  each  group,  a  true 
relative  risk  of  6.0  would  be  needed  to  result  in  an 
apparent  relative  risk  of  1.8. 

The  region  in  the  lower  left  portion  of  Figure  7 
may  be  considered  the  region  of  incompetence.  It 
corresponds  to  situations  in  which  the  L  group  would 
be  a  oetter  indicator  of  high  exposure  than  the  H 
group:  in  other  words,  inept  choice  of  surrogate 

exposure  variable.  Such  instances  are  unlikely  to 
occur  so  this  region  may  be  ignored. 

It  is  important  to  note  that  for  certain 
combinations  of  misclassifications,  an  apparent  relative 
risk  of  1.8  (or  any  other  value)  could  not  be 
observed.  For  instance,  if  both  the  H  and  L  groups 
have  50%  misclassification,  the  apparent  relative  risk 
will  be  1  regardless  of  the  true  risk.  Therefore  a  low 
apparent  relative  risk  may  not  be  a  reflection  of 
aosence  of  hazard  but  may  simply  be  due  to 
imprecise  exposure  classification. 

For  practical  purposes,  the  approach  suggested 
here  may  be  used  to  set  reasonable  bounds  between 
which  the  true  relative  risk  may  be  assumed  to  lie 
given  that  an  investigation  has  obtained  a  particular 
apparent  relative  risk.  The  amount  of 

misclassification  assumed  to  be  operating  may  be  set 
cither  by  what  appears  to  be  reasonable  (i.e.,  between 
10%  and  30%)  or  by  relevant  existing  or  obtainable 
information. 
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Figure  3.  Graph  of  M(<i,.6,.8),  the  Thu  Relative  Risk  required  to 
be  consistent  with  an  Apparent  Relative  Risk,  a,  if  60  percent  of 
the  High  Group  actually  have  high  exposure  (A >.6,  i.e.,  40  percent 
are  misclassified)  and  80  percent  oi  the  Low  Group  actually  have 
low  exposure  (I  =.8,  i.e.,  20  percent  are  misclassined). 
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Introduction 

Morton^’^  reported  significant  excess  cancer 
mortality  rates  among  housewives  compared  to  women 
employed  outside  the  home.  Subsequently,  Sterling 
and  Weinkam^  reported  that  age-specific  morbidity 
ratios  for  all  chronic  conditions  were  si^ificantly 
larger  for  housewives  than  for  employecT  women. 
These  findings  raise  the  question  of  whether  the 
incidence  as  well  as  prevalence  of  chronic  diseases, 
and  in  particular  cancer,  is  larger  in  general  among 
housewives  than  among  women  employed  in  other 
occupations.  Such  a  difference  woula  have  impprtant 
consequences  not  only  for  housewives’  chronic  disease 
and  cancer  morbidity  but  also  for  their  cancer 
mortality  insofar  as  morbidity  is  related  to  mortality. 
Verbrume^f  has  shown  that  while  morbidity  rates 
from  chronic  disease  are  higher  for  women  than  men, 
mortality  rates  from  chronic  diseases  are  higher  for 
men  than  women.  Among  women,  however,  it  seems 
likely  that  groups  that  tend  to  have  higher  morbidity 
rates  for  chronic  diseases  also  may  be  expected  to 
have  higher  mortality  rates  from  the  same  diseases. 
The  availability  of  large  archives  of  population 
morbidity  data  such  as  those  collected  by  the  U.S. 
National  Health  Interview  Survey  (NHIS)  makes  it 
possible  to  compare  morbidity  of  homemakers  to  that 
of  women  employed  outside  the  home.  Using  public 
use  tapes  of  the  NHIS,  we  compared  morbidity  of 
homemakers  with  those  of  employed  women  for  two 
blocks  of  time:  1970-1975  and  1982-1987. 

Method 

The  NHIS  collects  information  on  a  nationwide 
sample  of  households  as  part  of  the  ongoing  activity 
of  the  National  Center  for  Health  Statistics.  Each 
week  a  sample  of  households  is  selected  from  the 
civilian,  non-institutionalized,  U.S.  population  using  a 
stratified  probability  sampling  technique  in  such  a  way 
that  each  weekly  sample  is  representative  of  the 
target  population  and  the  weekly  samples  are  additive 
over  time.” 

Public  use  tapes  of  the  NHIS  for  1970  to  1975 
and  1982  to  1987  were  used.  Individuals  were 
classified  according  to  race  ^hite,  nonwhite),  age  (by 
5-year  age  groups  for  ages  M  to  64)  and  occupation 
(employed  outside  the  home  and  homemaker). 
Homemakers  are  those  who  indicated  that  their  usual 
activity  was  "keeping  house"). 

For  purposes  of  making  prevalence  estimates  of 
chronic  conditions,  a  list  of  diseases  was  read  to  each 
NHIS  sample  member.  The  respondent  reported  his 
or  her  e^eriences  with  each  disease  on  the  list. 
During  1970  to  1977  one  list  per  year  was  asked  of 
all  sample  members.  Prevalence  estimates  for  a 
particular  condition  may  be  obtained  only  in  the  year 
in  which  the  condition  was  probed.  For  each  year 
1982  to  1987  each  annual  sample  was  divided  into 
one-sixth  subsamples.  All  members  of  a  particular 
subsample  were  a.sked  to  respond  to  one  of  the  six 


lists.'  Using  these  data  it  is  possible  to  compute 
estimates  of  national  prevalence  rates  for  certain 
chronic  conditions.  However,  only  in  1982  and  1983 
did  the  NHIS  probe  for  the  existence  of  any 
malignant  neoplasm.  Thus  it  is  possible  to  obtain 
national  prevalence  estimates  for  any  form  of  cancer 
based  only  on  the  1982  and  1983  data  (although  for 
selected  sites  it  is  possible  to  use  data  for  1%2  to 
19871. 

National  estimates  of  the  total  number  of  persons 
and  of  prevalence  rates  for  various  causes  were 
computed  for  each  race-age-employment  group,  for 
each  ye^  and  for  all  years  combinecT  SPRs 
(Standardized  Prevalence  Ratios)  were  computed  in  a 
manner  identical  to  the  computation  of  the  familiar 
Standardized  Mortality  Ratio®  using  the  employed 
population  as  the  referent.  V^iances  were  computed 
using  the  appropriate  g^eralized  variance  function 
recommended  by  NCHS^  and  confidence  intervals 
computed  under  the  assumption  of  a  log  normal 
distribution  for  the  SPR  estimate.® 

All  analyses  were  done  on  weighted  data.  The 
National  Center  for  Health  Statistics  weighting  factors 
compensate  for  sampling  variation  within  different 
sampling  areas  and  adjust  the  data  to  the  race-age-sex 
distribution  of  the  non-institutionalized  U.S.  population 
as  determined  by  the  U.S.  Current  Population 
Survey.” 

Results 

Homemakers  show  an  increased  prevalence  over 
employed  women  of  each  chronic  condition 
investigated,  with  the  lone  exc^ion  of  breast  cancer. 
Table  1  gives  estimates  of  SFTls  and  correpsonding 
95%  confidence  intervals  for  chronic  conditions  for 
1970  to  1975  and  1982  to  1987.  Prevalence  ratios 
larger  than  1.0  indicate  increased  condition  prevalence 
for  homemakers  compared  with  employed  women. 
Hypertension,  ischemic  heart  disease,  stroke  and 
combined  bronchitis,  emphysema  and  asthma  (for 
1982  to  1987  but  not  for  1970  to  1975)  all  exhibit 
statistically  increased  prevalence  ratios.  Each  cancer 
site  considered  (except  for  breast  cancer)  and  all 
cancers  combined  showed  an  increased  prevalence 
ratio  among  homemakers.  However,  that  increase  fell 
short  of  the  customary  level  of  rejecting  the  null 
condition  with  p^.5,  possibly  due  to  the  relative 
scarcity  of  data  for  these  conditions.  Because  cancer 
SPRs  for  all  sites  but  one  are  elevated  and  because 
the  SPRs  for  1970-1975  and  1982-1987  are  very  much 
alike,  the  cancer  risks  may  be  considered  as 
significantly  elevated  as  well.  Our  data  then  support 
the  conclusion  that  women  working  at  home  have  a 
significantly  higher  prevalence  rate  of  all  chronic 
conditions  when  compared  with  women  working 
outside  the  home. 

The  excess  prevalence  of  chronic  conditions  among 
homemakers  relative  to  employed  women  may  be  due 
to  the  occupational  exposures  of  homemaking  or  to  a 
number  of  confounding  factors.  Some  women  may 
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have  to  select  homemaking  because  they  suffer  from 
a  chronic  disease  that  prevents  them  from  seeking 
and  holdine  down  full  time  employment. 

The  selection  bias  for  adopting  housework  as  an 
occupation  because  of  already  existing  disease  may  be 
controlled  for  to  some  extent  by  adjusting  each 
chronic  disease  for  the  difference  between  risks  of 
homemakers  and  employed  women  risk  for  all  chronic 
diseases.  Such  an  acyustment  may  be  simply  done  by 
dividing  the  risk  ratios  for  Cancer,  Heart  and  Other 
Chronic  Diseases  of  Table  1  by  the  SPR  for  all 
chronic  conditions.  The  result  of  that  adjustment  can 
be  seen  in  Table  2.  After  this  adjustment,  the  SPRs 
are  still  elevated  for  all  conditions  except  for  breast 
cancer  and  hypertension  for  the  period  1970-1975  and 
for  breast  cancer  but  not  hypertension  for  the  period 
1982-1987.  Again  we  note  that  the  SPRs  were 
elevated  for  6  out  of  9  conditions  in  1970-1975  and  9 
out  of  11  conditions  in  1982-1987.  The  probability  of 
that  many  increased  SPRs  arising  from  the  population 
with  similar  prevalence  of  disease  ten  years  apart  by 
chance  alone  is  vanishingly  small. 

Differences  in  prevamnce  rates  are  unlikely  to  be 
due  to  smoking.  While  our  analysis  combines  NHIS 
data  for  the  years  1970  to  1975  and  1982  to  1987 
smoking  information  was  obtained  only  for  1970  and 
1987.  However,  for  both  years  similar  percentages  of 
homemakers  and  otherwise  employed  women  smoked 
(see  Table  3).  Weinkam  and  Sterling  have  shown  a 
similar  lack  of  difference  in  smoking  prevalence 
between  homemakers  and  otherwise  employed  women 
for  the  1979-1980  NHIS.^® 

Discussion 

Homemaking  or  housekeeping  has  not  been  and  is 
not  now  generally  considered  an  occupation. 
Historically,  the  keeping  of  the  house  and  the  care  of 
children  was  considered  woman’s  work,  but  not  in  the 
sense  of  an  occupation.  It  was  an  obligation  and 
duty,  performed  by  the  wife.  Even  in  our  advanced 
Western  societies  and  in  developing  countries 
housework  is  still  not  recognized  as  an  occupation 
entitled  to  some  of  the  basic  consideration  of 
employment  such  as  coverage  by  Social  Insurance 
(Social  Security  in  the  U.S.),  or  by  Workman’s 
Compensation,  or  by  pay  (or  even  by  a  recognized 
commercial  value). 

Yet,  housework  has  all  the  earmarks  of  an 
occupation.  It  is  performed  in  a  workplace,  the 
home.  It  requires  a  number  of  skills,  some  of  them 
rather  intricate.  The  obligations  of  the  worker  can 
be  de.scribed  (including  cooking,  cleaning,  washing, 
various  types  of  yardwork.  maintenance  and  repair, 
use  of  appliances  etc.).  Whether  or  not  it  is  officially 
recognized,  housework  has  a  definite  commercial 
value.  The  cost  of  replacing  the  houseworking  spouse 
with  paid  help  may  be  considerable,  as  the  cost  of 
care  for  the  handicapped  or  the  elderly  proves. 

Like  all  occupations,  housework  has  its  hazards. 
These  hazards  may  be  divided  into  two  groups,  that 
of  occupationally  related  accidents,  and  that  of 
chronic  disease  following  exposure  to  toxic  materials. 

Homemaker’s  Potential  Exposure  to  Toxic  Substances 

Table  4  (expanding  listings  by  Gleason  et  al")  lists 
toxic  components  commonly  found  in  the  home. 
Perhaps  ine  most  serious  ewosurc  is  to  modern 
household  cleaners.  They  are  favorite  household  tools 
because  they  relieve  the  homemaker  of  considerable 
physical  exertion.  However,  they  expose  the  u.ser  to 


extremely  toxic  agents  in  p^ts  of  the  dwelling  that 
have  usually  the  poorest  air  circulation.  Another 
possibly  very  serious  exposure  may  be  to  toxic 
material  brought  home  on  hair,  skin  and  clothing  by 
industrial  workers  who  are  members  of  the  household. 
Such  exposures  have  been  shown  to  lead  to  specific 
illnesses  that  are  distinctly  related  to  occupational 
exposures.  Cases  in  point  are  mesothelioma  or 
beryllioses  among  family  members  of  individuals 
employed  in  occupations  where  they  may  bring  home 
asbestos  or  berylfium.  These  observations  raise  the 
possibiliw  that  other  diseases  of  members  of  a 
household  may  not  be  recognized  as  being  of 
occupational  origin.  Finally  we  include  basements 
because,  where  mere  is  a  background  basements  are 
the  major  avenue  for  radon  gas  penetration. 
Homemakers  can  accumulate  high  levels  of  exposure 
because  of  the  length  of  exposure. 

The  basis  for  our  results  is  the  statistical  analysis 
of  the  National  Household  Interview  Survey,  and  not 
of  medically  established  cases.  However,  coupled  with 
the  observation  that  homemakers  may  be  exposed 
substantially  to  carcinogens  at  home,  very  often  in 
unventilated  spaces  and  subjected  therefore  to 
repeated  relatively  large  doses,  the  conclusion  that 
homemakers  are  at  an  increased  risk  from  Cancer 
compared  with  women  in  other  employment  seems 
plausible.  (A  more  definitive  answer  will  come  from 
a  case-control  study  underway  and  from  a  more 
detailed  analysis  of  household  use  of  toxic  material.) 
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TABLE  1 

Standardized  Prevalence  Ratios  and  95%  Confidence  Intervals  For  a  Number  of 
Chronic  Conditions  of  Homemakers  Standardized  to  Women  Employed  Outside  the  Home 

1970  -  1975  1982  -  1987 


(LC,  UC)  (LC,  UC) 


Any  Cause 

1.07 

(1.01,  1.13) 

1.17 

(1.11,  1.23) 

Any  Chronic  Condition 

1.11 

(1.05,  1.18) 

1.21 

(1.15,  1.28) 

Hypertension 

1.31 

(1.17,  1.47) 

1.44 

(1.27,  1.64) 

Ischemic  Heart  Disease 

1.63 

(1.17,  2.29) 

1.63 

(1.18,  2.24) 

Stroke 

3.29 

(1.50,  7.21) 

3.25 

(1.84,  5.73) 

Bronchitis/ 

Emphysema/Asthma 

1.10 

(0.97,  1.25) 

1.23 

(1.07,  1.42) 

Any  Cancer 

1.58 

(0.88,  2.86) 

1.46 

(0.76,  2.82) 

Breast  Cancer 

0.99 

(0.33,  2.57) 

0.95 

(0.53,  1.73) 

Lung  Cancer 

N/A 

3.41 

(0.43,  26.91) 

Genital  Urinary  Cancer 

2.24 

(0.16,  31.53) 

1.84 

(0.59,  5.67) 

Leukemia 

1.46 

(0.04,  52.74) 

1.92 

(0.12,  31.3  ) 

Cancer  of 

Digestive  Organs 

3.16 

(0.41,  25.59) 

2.60 

(0.70,  9.62) 

Respiratory  Cancer 

N/A 

4.27 

(0.57,  32.10) 

TABLE  2 

Standardized  Prevalence  Ratios  and  95%  Confidence  Intervals  For  a  Number  of 

Chronic  Conditions  of  Homemakers  Standardized  to  Women  Emnloved  Outside  the  Home 

Adjusted  for  Differences 

in  Standardized  Prevalence  Ratios  For  Any  Chronic  Condition 

1970  -  1975 

1982  - 

1987 

(LC,  UC) 

(LC,  UC) 

Any  Chronic  Condition 

1.00 

1.00 

Hypertension 

0.95 

(0.84,  1.09) 

1.19 

(1.038,  1.37) 

Ischemic  Heart  Disease 

1.19 

(0.85,  1.68) 

1.34 

(0.97,  1.86) 

Stroke 

2.40 

(1.09,  5.28) 

2.68 

(1.52,  4.75) 

Bronchitis/ 

Emphysema/Asthma 

1.00 

(0.86,  1.14) 

1.02 

(0.88,  1.19) 

Any  Cancer 

1.16 

(0.65,  2.10) 

1.04 

(0.71,  1.50) 

Breast  Cancer 

0.89 

(0.34,  2.32) 

0.79 

(0.43,  1.43) 

Lung  Cancer 

N/A 

2.82 

(0.36,  22.27) 

Genital  Urinary  Cancer 

1.52 

(0.18,  13.70) 

1.52 

(0.93,  4.69) 

Leukemia 

1.07 

(0.03,  38.50) 

1.59 

(0.10,  25.91) 

Cancer  of 

Digestive  Organs 

2.14 

(0.27,  16.62) 

2.15 

(0.58,  7.96) 

Respiratory  Cancer 

N/A 

3.53 

(0.47,  26.56) 

TABLE  3 

Percent  of  Current,  Former,  Ever  and  Never  Smokers  Among  White 
Homemakers  and  Otherwise  Employed  Women  for  1970  and  1987 


1970 

Current 

Former 

Ever 

Never 

Homemakers 

29.3 

12.6 

41.9 

58.1 

Otherwise  Employed 

1987 

35.0 

11.9 

46.9 

53.1 

Homemakers 

31.7 

15.8 

47.5 

49.9 

Otherwise  Employed 

32.3 

15.8 

48.1 

49.7 
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TABLE  4 

Exposure  to  Toxic  Substances  from  Use  of 
Products,  Appliances  and  Other  Activities 

Possible  Components  May  Include: 


Homemaker’s  Potential 
Household  Consumer 

Item  Use 

Household  cleaners: 

Window 

Spot/Textile 

Soaps/Detergents 

Oven 

Drain/Toilet  bowl 
General  cleaning 

Home  Repair/Maintenance 
Paints/Varnish 

Pesticides 

Lawn  &  Yard  Care  Items 
Appliances 
Humidifiers 

Gas  Range  &  Heater 
Kerosene  Heaters 
Electrical  Equipment 
Spray  Aerosol 

Disinfectants 

Furniture/Carpets  Offgassing 
Basement 

Smoking 

Washing  Clothing 

Hobbies 

Beauty/Grooming  Aids 


Ammonium  hydroxide 

Tetrachloroethylene.  Trichloroethylene,  Methyl 
alcohol,Petroleum  aerived  solvents.  Methanol, 
Benzene 

Polyether  sulfates.  Alcohols,  Sulfonates,  Alkyl 
sodium  isothianates 

Sodium  hydroxide.  Potassium  hydroxide 
Sodium  hydroxide.  Lye 

Ammonium  hydroxide,  Chlorine,  Lye,  Sodium 
hypochlorite.  Sodium  peroxide 


Toluene  Xylene,  Methylene  Chloride,Heavy  metal 
pigments.  Methanol,  Ethylene  glycol.  Benzene 

Organopho^hates,  Carbamate^  IVethroids, 
Botanicals  (plant  derivatives).  Biological,  Growth 
regulators 

Various  pesticides,  herbicides,  gasoline,  oil,  paints, 
fertilizers 


Off-gassing  from  water  components,chlorinated 
contaminants.  Biological  organisms 

N02,  CO,  Formaldehyde 

N02,  CO,  S02,  General  petroleum  hydrocarbons 
Ozone 

Propane,  Butane,  Nitrous  Oxide,  Methylene 
propellents.  Chloride,  Isobutane,  Fluorocarbon  11 
iad  12 

Sodium  hypochlorite,  Quarternary  ammonium. 
Phenols,  Fine  oils 

Formaldehyde,  General  organics.  Residues 

Radon  daughters  (depending  on  background  and 
ventilation) 

Environmental  Tobacco  Smoke 

Toxic  material  brought  home  on  clothing  by 
employed  member  of  household  (such  as 
asbestos,  beryllium  etc.) 

Depends  on  the  hobby 

Alcohol,  sodium  hydroidde,  Thioglycollates,  Talc, 
Benzethonium 
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Abstract 

Sampling  techniques  and  exact  solutions  of  Riemann 
Problems  are  us^  in  a  random  choice  method.  This 
procedure  is  used  to  obtain  the  numerical  solutions  of  a 
system  of  conservation  laws  which  describes  the  dynamics 
of  flow  for  small  amplitude  two-dimensional  shockwaves. 
An  intrinsic  coordinate  system  is  used  to  formulate  the 
model. 

1  Introduction 

Accuracy  of  numerical  solutions  and  efficiency  of 
numerical  schemes  are  major  concerns  in  obtaining 
numerical  solutions.  Moreover,  the  numerical  solution 
at  the  jump  discontinuities  called  shocks  should  remain 
sharp,  stable  and  transports  the  discontinuities  at  the 
correct  physical  speed.  Random  variables  have  been  used 
to  control  numerical  dissipation  or  to  control  numerical 
viscosity.  Basically,  random  variables  appear  either  as  a 
component  added  to  the  deterministic  equation  to  study 
the  effect  of  numerical  viscosity  or  they  are  used  to 
sample  the  solution  at  a  randomly  chosen  point  to  obtain 
a  numerical  solution  which  preserves  some  mathematical 
properties  of  the  solution  function.  Tlie  purpose  of  this 
paper  is  to  present  a  random  choice  method  for 
computing  the  numerical  solution  of  two-dimensional 
small  amplitude  shockwaves.  The  numerical  random 
sampling  procedure  is  a  shock  capturing  and  a  marching 
time  method  for  solving  system  of  conservation  laws. 
The  random  sampling  procedure  consists  of 
approximating  the  numerical  solution  by  a  piecewise 
constant  state  at  each  time  step  and  proceeding  to  the 
next  time  step  by  solving  the  corresponding  problems 
formed  by  the  constant  on  the  neighboring  spatial 
intervals.  It  is  well-known  that  the  exact  solution  of  non¬ 
linear  system  of  partial  differential  equations  arising  in 
fluid  flow  problems  even  with  smooth  initial  data 
develops  shocks  (jump  discontinuities)  in  a  finite  time 
interval.  Thus  it  is  not  unnatural  to  approximate  their 
initial  data  with  constant  states. 

The  sampling  procedure  is  based  on  approximating  the 
numerical  solution  of  the  given  problem  with  a  sequence 
of  elementary  problems,  known  as  the  Riemann 
problems.  These  Riemann  problems  can  be  thought  of 
as  information  source  about  the  solution  within  each 


spatial  mesh  interval.  More  importantly  they  provide 
valuable  information  on  wave  interaction. 

Godunov  [1]  initiated  utilizing  the  solutions  of  the 
Riemann  problems  as  building  blocks  for  the 
construction  of  numerical  solution  of  the  nonlinear 
hyperbolic  partial  differential  equations.  Godunov 
replaced  the  initial  data  by  a  piecewise  constant  states 
with  jump  discontinuities  at  the  middle  of  spatial  mesh 
interval.  Then  the  exact  solution  of  this  Riemann 
problem  at  the  first  time  step  is  calculated.  To  proceed 
to  the  next  time  step  replace  this  exact  solution  by  a  new 
piecewise  constant  state  approximation  and  solve  the 
corresponding  Riemann  problem  and  maintain  integral 
properties  of  the  conserve  variable. 

Another  utilization  of  Riemann  problems  in  obtaining 
the  solution  of  conservation  laws  was  initiated  by  Glimm 
[2]  who  followed  Godunov  as  far  as  obtaining  the  exaa 
solution  of  Riemann  problem  and  then  the  value  of  the 
new  approximated  solution  at  the  new  time  step  is  taken 
to  be  the  exact  solution  evaluated  at  a  random  point  on 
that  mesh  interval.  This  solution  is  conservative  on  the 
average,  however,  has  the  advantage  that  near  jump  the 
solution  is  incremented  either  by  the  amount  of  jump  or 
not  at  all.  This  forces  that  an  initially  sharp 
discontinuities  remains  sharp.  Chorin  [3]  developed 
Glimm’s  random  choice  method  into  a  numerical 
technique.  The  random  choice  method  by  its  way  of 
construction  propagates  shocks  without  introducing  any 
dissipation  and  the  method  is  unconditionally  stable. 
However,  because  of  approximating  solution  at  a 
randomly  chosen  point  a  small  amount  of  statistical  noise 
enters  into  the  solution  which  is  acceptable  within  the 
accuracy  imposed  by  discretization  of  model  problem. 


2  Twc>-E>imeiisional  Flow  Problem 
The  equations  describing  the  two-dimensional  flow  of 
shockwaves  with  a  source  term  in  fluid  dynamics  for 
compressible  fluid  may  be  written  in  the  form 

(1)  Uf.  +  f(u)^+g(u)y  =  h(u,x,y,t) 

where  f  and  g  are  physical  fluxes,  h  is  a  source  term  and 
the  unknown  quantity,  u  is  a  function  of  x,  y,  t.  Denoting 
the  front  coordinate  by  a  and  letting  the  coordinate  B  be 
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the  arc  length  measured  from  a  reference  point  along  the 
front,  then  the  successive  front  positions  are  given  by  the 
family  of  curves,  a  =  constant  and  the  ray  positions  by 
the  family  of  curves,  B  s  constant.  By  using  this  intrinsic 
coordinate  system  a  B  (see  Whitham  [8])  where  a  and  B 
are  functions  of  x,  y,  t,  equation  (1)  can  be  written  as 

(2)  w',+F(v)p  =G(v,  o,  p) 

subject  to  the  initial  condition  given  by 

(3)  fcr(0,  P)  =  Wi,(P)  . 

Equations  relating  x  and  y  to  a  and  B  are  given  by 

X,  =  (1 +!Ain4>)  cos  (0)  Xj  =  -Asin(6) 


y.  =  (l+Vimt^)  sin(0)  yp=Acos(0) 

Here  8  is  the  angle  that  each  front  makes  with  the 
positive  x-axis,  A  is  the  cross-sectional  ray-tube  area,  and 
m  is  the  acoustic  Mach  number.  For  small  amplitude 
two-dimensional  shockwaves  we  have 


f  ^  1 

'  -0 

A0 

Fiw)  = 

(jn<|>-0^)  /2 

.  0 

(4) 


0  ' 
0 

V  4CZJ 


G{w)  = 


where  C  is  the  local  sound  speed,  ^  is  the  nonlinearity 
constant  which  depends  on  the  media  and  Z  is  the  area 
under  the  initial  pulse.  For  a  detail  discussion  of  these 
equations  see  Zakeri  [4]-[5].  To  solve  (2)-(3)  we  use 
operator  splitting  method  to  remove  the  inhomogeneous 
term  G(w,  a,  B).  That  is,  first  we  solve  the 
corresponding  one-4imensional  homogenous  problem. 


(5)  w.  +  F(v)p  =  0 

by  sampling  procedure  and  then  we  use  its  solution  to 
determine  the  value  of  the  inhomogeneous  term,  G(w,  a, 
B).  Finally,  we  solve  the  corresponding  ordinary 
differential  equations  (DDEs)  given 


given  by  (2)-(3).  One  of  the  advanta^  of  formulation 
of  a  model  problem  using  geometric  shock  dynamics  is  its 
simplicity.  To  develop  a  random  choice  method  first  we 
must  define  a  random  variable  defined  over  closed 
interval  [-M,  Vi].  It  is  absolutely  necessary  that  the 
successive  values  of  the  random  variable  tend  to 
approximate  equi-partitioning  the  closed  interval  [  -Vi.  Vi] 
(see  Glimm  [2]).  To  generate  such  random  variable  let 
us  consider  a  sequence  of  pseudorandom  integers 
generated  by 

(7)  (modic) 

where  k  is  an  odd  positive  integer  and  Nq  is  an  arbitrary 
integer  less  than  k.  Let  us  define  an  equidistributed 
sequence  random  variables,  on  the  interval  [-Vi  ,  Vi] 
given  by 


We  introduce  front-ray  grid  defined  by  mesh 

lengths  Att  and  aP  .  The  solution  of  (2)-(3)  is  to  be 
calculated  both  at  grid  points,  i.e.,  at 

Pin,  j)  =  inaa  ,  jAp) 

and  at  the  center  of  rectangle  grid  point,  i.e.,  at 

PCii+Va,  j+Va)  *  ( (n  +  Vi)  Aa  ,  (j  +V2)  aP) 
where  n  and  j  are  integers.  We  denote  the  approximate 


value  ofw  at  the  grid  point  by  Wj  =  (riAS  ,  JaP)  . 

Following  the  outline  given  above,  let  us  consider  the 
corresponding  local  Riemann  problem  to  (2)  when  front 


is  at  riAtt  along  with  the  piecewise  constant  initial  data 
given  by 


(9)  F.  +  FiR)  p  =  0 


Rinaa,  P) 


;  p  <  JaP 
;  p  k  JaP 


where 


(6)  =  Giw,  a,  P)  . 

To  solve  (6)  we  use  a  common  ODE  solver  such  as 
Runge-Kutta  or  a  multi-level  method. 

3  Numerical  Scheme 

We  develop  a  numerical  scheme  to  compute  the 
successive  shock  fronts  using  geometrical  shock  dynamics 


J^j-Vii-D* 

e  =  V4(l+sgn(o„*i)  -1) 

i.e.  e  =  0  or  1  whenever  is  negative  or  non¬ 
negative  respectively.  The  Riemann  problem  here  is 

sampled  at  (j+'V4)aP  and  at  (j-14)aP. 
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When  is  non-negative  the  initial  data  for  Riemann 
problem  formed  by  using  information  at  grid  points 
P(n,j)  and  P(n,j+1)  and  if  is  negative  then  the 
initial  data  is  constructed  by  using  information  at  grid 
points  P(n,j)  and  P(n,j-1).  At  point  P(n+V4,  j+V4)  we 
define 

=  (n  +  y2)Aa) 

On  each  mesh  interval  we  get  a  local  Riemann  problem. 
In  order  to  assure  that  the  waves  produced  by  this 
sequence  of  local  Riemann  problems  do  not  interact  we 
must  have 

(10)  <  1. 

Aa 

This  important  requirement  is  known  as  Courant- 
Friedrichs-Lewy  (C^)  condition.  If  inequality  (10) 
holds  then  we  can  combine  the  solutions  of  the  Riemann 
problems  (9)  into  a  single  exact  solution. 

4  Solution  of  Riemann  Problem 
The  main  part  of  a  random  choice  algorithm  is  obtaining 
the  solutions  of  a  sequence  of  local  Riemann  problems 
efficiently.  The  solution  of  a  Riemann  problem  consists 
of  three  elementary  waves,  a  backward  shock  wave  or 
rarefaction  on  left,  a  slip  line,  and  a  forward  shock  or  a 
rarefaaion  on  right.  A  slip  line  is  a  discontinuous 
solution  separating  two  constant  states  such  that  the 
angle  of  flows  remain  the  same  on  both  sides  of  the 
discontinuity  line  while  Mach  number  is  arbitrary.  Slip 
lines  are  one  family  solution  between  the  backward  and 
forward  waves,  i.e.,  between  rarefactions  and  shocks.  To 
solve  the  Riemann  problem  (9)  we  follow  Lax  [6].  Let  us 
consider  the  following  initial  data  for  system  of  equations 
in  (9) 


(H) 


R(nta,  P) 


;  p  <  JaP 

;  p  2  JAp 


where  subscripts  1  and  2  refer  to  values  of  w  just  behind 
of  and  just  ahead  of  the  discontinuity  respectively.  If 
these  two  values  are  equal  then  the  solution  of  (9)  is  a 
constant  state  and  its  value  is  equal  to  the  value  of  initial 
data.  However,  if  these  two  values  are  different  then  the 
initial  jump  discontinuity  will  propagates  in  the  form  of 
a  center  mqmnsion  wave  an^or  a  shock  (i.e.,  jump 
discontinuity  satisfies  the  entropy  condition.)  or  a  contact 
discontinuity.  In  order  that  solution  converges  to  a 
unique  weak  solution  of  (9),  it  must  satisfies  the 
Rankine-Hugoniot  jump  condition  and  the  Oleinik 
entropy  condition.  At  the  shock,  let  us  define  the  values 
of  R(a3)  just  behind  of  and  just  ahead  of  the  shock  by 


Ri  =  limR(a,P)  Rj  =  limR(o,P) 

The  jump  conditions  for  system  of  equations  (4)  are 
gi^^n  by 

da  Aj 

n2i  ~  <02-e!) 

^  ^  da  ~  2  (^2  02  -AiBi) 

A2in|  . 

The  entropy  condition  is  given  by 

FiR^)  -F(R)  FiR^)  -F(R^) 

J?2  ~  F  J?2  “  ■^1 

for  any  R  between  R2  and  R^.  The  entropy  satisfaction 
is  a  major  concern  for  numerical  approximation  of 
solutions  of  nonlinear  fluid  flow  problems.  This  simply 
means  that  the  computed  solution  converges  toward  the 
correct  physical  solution  as  the  mesh  sizes  of  intervals 
along  a  and  B  approach  to  zeros.  The  above  inequality 
can  be  written  as 

E(R)  ^F(R^)  +-^  (R-R^) 

satisfying  the  following  inequality 

(E(R)  -F(R)  )  (i?i  -i?2)  iO 

where  E(R)  defines  the  chord  connecting  left  and  right 
limiting  points  across  the  shock.  The  entropy  related  to 
the  first  component  of  F  is  given  by 

e,-e  ^  e,-e, 

A2  A2 

for  any  A  between  A,  and  A^,  and  6  between  8^  and  62. 
Similar  inequalities  hold  for  other  components  of  F. 

4.1  Rarefection  Waves 

Rarefaction  waves  are  two  families  of  solutions  curves, 
forward  and  backward  waves.  In  this  section  we  compute 
the  simple  rarefaction  waves  of  system  of  equations  (9) 
which  can  be  reformulated  in  the  form 

£7.  +  H(  C7)  C7p  »  0 

where  H(U)  =  H(A,  8,  m)  is  3  by  3  matrix  given  by 
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1 

'o  -1  o' 

'A' 

0  0  Jl 

6 

H  = 

2A 

,m, 

0  0 

1 

1  2A 

The  eigenvalues,  |i  and  thier  corresponding  eigenvectors, 
e  of  the  matrix  H  are 


1 

e=l 

V-2ii2A/4>j 

For  the  system  of  equations  in  (9)  the  simple  rarefaction 
waves  are  the  continuous  solutions  of  (9)  of  the  form 


(13) 


Cf(a,  p) 


if  |<n(i?i) 
v(-|)  if  l=ii(v) 
if 


where  v  is  an  integral  curve  of  the  vector  field  of  the 
corresponding  eigenvector  connecting  the  two  constant 
states  such  that  the  corresponding  eigenvalue,  p  is 
increasing  between  this  two  constant  states  from  left  to 
right.  Since  the  matrix  H  has  three  distinct  eigenvalues, 
there  are  three  possible  rarefaction  waves  through  any 
given  state.  These  rarefaction  waves  are  the  integral 
curves  of  the  vector  field  defined  by  each  eigenvector  of 
matrix  H,  i.e.,  each  eigenvector  is  tangent  at  each  point 
of  integral  curve.  Thus  for  the  eigenvector  e 
corresponding  to  the  eigenvalue  p  of  matrix  H,  the 
integral  curves  are  solutions  the  following  system  of 
equations 


dA  _  dd  _  dm 

1  -ji  -2\i^A/^ 

If  |i  =  0  then  the  integral  curves  of  its  corresponding 
eigenvector,  e  =  (1,  0,  0)  are  curves  where  0  and  m  are 
both  constants.  Hence  a  simple  rarefaction  wave  of  the 
form  of  (13)  exists  if  the  left  and  right  values  of  9  and  m 
across  the  shock  are  equal,  in  addition,  p  must  be  an 
increasing  function  of  6  and  m  from  left  to  right  across 
the  shock.  Therefore  there  is  no  A-rarefaction  wave. 
8-rarefactioo  waves.  If  p  is  not  zero  then  the  integral 
curves  of  its  corresponding  eigenvector  are  curves  where 
m^A  is  constant.  There  are  two  families  of  curves  where 
9  is  either  positive  or  negative  along  each  integral  curve. 


choice  method  developed  here  with  those  solutions 
obtained  using  the  method  of  characteristics.  Consider 
the  initial  condition 

A(0,  p)  =1 
m{0,  P)  =0.01 

t-5-P  -1<P<1 

12^  P 

0  Otherwise 

together  with  equations  (2)  and  (4).  The  results  obtained 
with  both  methods  were  very  close  to  each  other.  The 
numerical  calculations  show  that  the  convergence  of  the 
solution  toward  the  exact  solution  is  independent  of 
choice  of  the  odd  number  k  in  (7)  as  long  as  k  is 
bounded.  In  addition,  the  method  is  numerically  stable. 
All  the  striking  physical  features  of  ^tem  of  equations 

(2)  and  (4)  is  observed,  i.e.,  as  the  initial  front 
propagates  into  the  rest  state  the  center  of  hump 
becomes  flat  and  this  flat  region  propagates  on  both 
directions  untill  the  front  becomes  a  flat  surface. 

6  Conclusion 

The  numerical  solutions  show  that  the  method  is  stable 
and  correctly  describes  the  important  physical  feature  of 
the  solution  of  the  model  problem.  The  various  choices 
of  random  number  generators  do  not  have  any  effect  on 
the  accuracy  of  computed  solution  as  long  as  the  random 
variable  tent  toward  the  equips  rtitioning  of  the  given 
interval. 
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S  Numerical  Experiments 

We  compared  the  numerical  solutions  using  the  random 
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ABSTRACT.  Linear  co^^)a^tmental  systems  have 
compartments  with  flows  to  and  from  the  compartments. 
Of  interest,  is  the  estimation  of  the  constants,  $, 
governing  the  flows.  In  the  particular  system  considered, 
only  one  conqMutment,  out  of  several,  is  observed  for  n 
cases  over  k  time  points.  A  stochastic  model  is  used  with 
a  maximum  likelihood  approach  taken  to  the  estimation  of 
$.  The  algorithm  involves  iteratively  using  an  estimate  of 
$  to  solve  differential  equations  which  describe  the 
system,  and  improve  on  the  estimate  of  $  by  adding  a 
constant  multiple,  a,  of  an  increment  66.  Allen’s  results 
are  incorporated  to  obtain  required  derivatives.  Due  to 
non-zero  correlations,  a  modification  to  Jennrich  and 
Moore’s  results  is  made,  involving  using  both  the 
observations  and  their  cross-products,  to  obtain  6$.  a  is 
determined  with  Fletcher’s  method.  A  program  in  Turbo 
Pascal  implements  the  algorithm. 

INTRODUCTION.  Compartmental  systems  have  long 
been  a  useful  tool  in  pharmacokinetics  (Wagner  (1971)). 
The  body  is  thought  of  as  a  series  of  compartments,  with 
a  drug  moving  between  any  of  the  compartments.  For 
example,  Gladtke  et  al.  (1979)  p.  36,  suppose  that  the 
body  may  be  represented  as  three  compartments  -  plasma, 
muscle  and  extravascular.  An  initial  muscle  injection  is 
given  and  the  levels  of  the  drug  in,the  plasma  are 
monitored  from  time  to  time.  The  drugs  will  flow  from 
muscle  to  plasma.  Additionally,  flow  will  be  between  the 
extravascular  system  and  plasma,  and  from  plasma  to  the 
outside,  as  depicted  in  diagram  1. 

Diagram  1: 


I  Intra 
INuscular 


Plasma 

Extra 

1 

Vascutar 

Outside 


For  simplicity,  this  paper  will  be  restricted  to  a  three 
conqrartmental  system  of  this  type,  with  compartments 
1,2,3  and  the  outside,  as  compartment  0;  flows  have  a 
parameter  attached  to  them  as  in  diagram  2,  (In  the 


traditional  deterministic  model,  these  are  the  rate 
parameters). 


Diagram  2  : 


Compart¬ 
ment  1 

Compart¬ 
ment  Z 

62 

Compart¬ 
ment  3 

1®. 

Compart¬ 
ment  0 

It  is  assumed  that  a  bolus  injection  is  given  in  compartment 
1  and  only  compartment  2  is  observed,  at  times  t,,t2,...,t^. 
Let  00  initial  concentration  in  conqjartment  1  and 

C(t)  be  the  conc^tration  at  time  t.Then 


C(t)  = 

q  ( t) ' 
q(t) 

,  C(0)  = 

00 

0 

C3  ( t) 

0. 

STOCHASTIC  MODEL  .  For  the  "particle  model",  as 
discussed  by  Purdue  (1974a),  it  is  assumed  that  there  are 
N  particles  in  the  system  acting  independently.  Transitions 
betwe^i  compartments  follow  a  Markov  process,  with  the 
transition  probability  being  constant.  The  resulting  system 
of  equations  is  as  follows; 

dP(t)/dt  =  P(t)A\  . (1) 

where  P(t)  =  (py(t))  is  a  nonsingular  3x3  matrix  with  pij(t), 
ij= 1,2,3,  the  probability  of  a  particle  transferring  from 
compartment  "i"  to  compartment  "j"  in  time  t  and 


0 


0 

-(62+64) 

02 


0 

03 

-03 


6  —  (6q  0,  02  03  04)^  is  to  be  estimated. 
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Under  these  assumptions  it  can  be  shown  that  the  exact 
distribution  of  C(t),  given  C(0)  is  a  multinomial  (since  only 
C,(0)  is  nonzero),  but  the  distribution  of  C(t,),  given  C(t2), 
is  a  convolution  of  multinomials.  Thus,  to  write  down  the 
exact  distribution  of  the  concratrations  in  conqraitment  2  for 
t,,t2,...,tk,  when  k  is  large  is  inqx)ssible  since  it  would 
consist  of  many  sums  whose  limits  are  complex.  Using  a 
diffusion  approximation,  it  has  be^  demonstrated  that  the 
concentrations  over  time  have  a  multivariate  normal 
distribution,  Lehoczky  and  Caver  (1977).  Hence,  the 
distribution  of  the  concentrations  in  compartment  2  have  a 
(marginal)  multivariate  distribution,  with  mean  and 
variance-covariance  matrix  the  same  as  that  given  by  the 
particle  nxxlel .  Simpson  (1988)  has  shown  that  an  estimator 
derived  from  the  maximum  likelihood  equations,  using  this 
approximate  normal  distribution,  is  still  a  consistent 
asymptotically  normal  estimator,  if  the  particle  model  is  the 
correct  one.  Thus,  an  algorithm  was  written  to  estimate  $ 
using  the  approximate  normal  distribution. 

Suppose  that  is  the  concentration  in  compartment  2  for 
person  i,  i=l,2,...n  at  time  j  =  l,2,...k.  X;  = 
(Xi,,Xi2,...  .X;^  are  the  k  observations  of  ith  case.  X|,X2,..., 
X,  are  independently  and  identically  distributed  as  a  normal 
variable  with  mean  and  variance  which  are  functions  of  the 
probability  matrix  P.  By  considering  a  vector  comprised  of 
the  sufficient  statistics  of  the  covariance  of  X,  it  can  be  seen 
that  a  normally  distributed  random  variable  is  a  linear 
exponential  and  so  the  algorithm  of  Jennrich  and  Moore 
(1975)  ,hence  referred  to  as  JMA,  can  be  used  to  find  the 
approximate  maximum  likelihood  estimator  .  We  take  the 
k,xl  vector  Y,  where  k,=  k(k+3)/2  and 

. 

'*■*  r=l  r=l  r=l 

r=l  r=l  r=l 

. 

r=l  r=l 


THE  ALGORITHM.  Suppose  that  the  k,xS  matrix  of 
partial  derivatives  with  respect  to  9;,  i=0,l,..,4  is  denoted 
by  dfildB.  According  to  the  JMA,  for  a  givra  B,  replace  B 
by  f+oM,  where 


This  requires  that 

(1)  P(.)  and  its  derivatives  with  respect  to  0  be  obtained 

(2)  M  be  calculated. 

(3)  An  appropriate  a  be  found. 


fU  Pf.i  and  its  derivatives: 

For  this  compartmental  system,  P(.)  and  its  derivatives  can 
be  calculated  analytically.  However,  to  maintain  generality 
for  the  program,  it  was  decided  to  obtain  P(.)  and  its 
derivatives  by  applying  the  method  developed  by  Allen 
(1987)  to  columns  of  P(.). 

The  steps,  for  each  column  of  P,  are  as  follows: 

(i)  Find  R  so  that 

-  R^AR  =  T  ,  a  triangular  matrix, 

-  R^  =  I,  the  identity  matrix 

(ii)  Using  R,  obtain  a  triangular  system  of  equations; 
so,  a  backward  solving  technique  can  be  used,  with 

the  initial  condition  P(0)  =  I  satisfied. 


For  any  8=  1,2,3,4,  the  equation  (1)  becomes 

(t)  =IV.(t)  R*X{t)  , 

CLt 


RV^{t) 


and  Alim’s  method  can  be  used  again  to  solve  for  V,(t), 
with  V.(0)  =  0,  as  the  initial  condition 


n 


E 


^rk-lr  E  ^rk-l^Tk' 


r-l 


n 


E 


Using  properties  of  nonnaiity,Y  has  a  mean,  n^B),  and 
variance-covariance  matrix,  E(0),  which  is  a  function  of  the 
mean  and  variance-covariance  matrix  of  X,  and  therefore  of 
P,  also. 


(2)  Calculation  of  SB: 

Using  Wilkinson's  algorithms,  Wilkinson  (1971),  a 
Cholesky  decomposition  is  used  to  find  where 
E''‘(E''‘)^=E.  Using  this,  the  following  equation  is  solved  to 
obtain  60: 
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(3^  An  a: 

Fletcher’s  descent  method,  Fletcher  (p.26,  1980)  is  used  to 
find  a.  It  uses  the  first  derivatives  only.  Assuming  uniform 
continuity  conditions,  it  will  achieve,  at  least,  a  local 
optimum  if  an  optimum  exists  and  if  the  starting  value  is 
close  mough. 

SUMMARY.  A  stochastic  approach  rather  than  the 
traditional  deterministic  model  approach  is  taken  to  a 
particular  con^>artmental  model,  where  only  one 
conqMutment  is  observed.  An  algorithm  is  developed  which 
uses  the  JMA  to  obtain  maximum  likelihood  estimates  from 
a  diffusion  approximation.  The  program  is  writtra  in  Tuibo 
Pascal  and  can  be  graeralised: 

(i)  to  other  linear  compartmental  systems, 

(ii)  for  0  to  be  fimctions  of  time. 

Work  is  being  done: 

(iii)  to  incorporate  measurement  error  in  the  model 

(iv)  to  include  people  variation 

An  important  step  for  its  general  use  would  be  to 
incorporate  this  program  in  a  general  pharmacokinetic 
program  so  that,  in  a  user  friendly  environment,  its 
estimates  could  be  easily  compared  to  those  obtained  from 
other  methods. 
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Abstract 

Nonp  .1  jimetric  Bayesian  estimators  with  Dirichlet  pro¬ 
cess  priors  for  doubly  censored  data  can  be  derived  from 
mixtures  of  Dirichlet  distributions.  To  circumvent  the 
computational  difficulties  in  evaluating  these  mixtures, 
this  paper  describes  the  Gibbs  sampling  approach  to 
approximating  them.  The  Gibbs  samplers  augment  the 
censored  data  by  the  number  of  observations  falling  into 
each  interval.  An  example  taken  from  Tbrnbull  (1974) 
is  given  to  illustrate  the  approach. 

Keywords:  Gibbs  sampling;  Stochastic  substitu¬ 
tion;  Dirichlet  process  priors;  Doubly  censored  data. 

1  Introduction 

Nonparametric  Bayesian  inference  for  the  survival  func¬ 
tion  with  right  censored  data  has  been  studied  by  Susar- 
la  and  Van  Ryzin  (1976),  and  Ferguson  and  Phadi- 
a  (1979).  However,  we  often  encounter  the  situation 
where  some  observations  are  censored  from  the  left  and 
•^ime  observations  are  censored  from  the  right.  Tbrn- 
bull  (1974)  has  cited  many  papers  addressing  doubly 
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censored  data  sets  and  their  frequentists’  analyses. 

This  paper  studies  a  nonparametric  Bayesian  ap¬ 
proach  to  the  data  analysis.  This  approach  allows  us 
to  incorporate  prior  belief  and  frees  us  from  making  a 
restrictive  model  assumption  for  the  survival  function. 
Specifically,  we  assume  the  survival  function  is  taken 
from  a  Ferguson’s  (1973)  Dirichlet  process,  V{a).  The 
prior  parameter,  a,  may  be  written  a  =  MFo,  where 
Fo  represents  the  statistician’s  prior  guess  of  the  dis¬ 
tribution  function  of  the  times  of  incident(death)  and 
M  represents  the  degree  of  concentration  of  the  true 
distribution  function  lu-ound  Fq. 

Due  to  the  doubly  censored  data,  it  is  usually  very 
difficult  to  obtain  an  explicit  expression  for  the  non¬ 
parametric  Bayes  estimators.  Fortunately,  it  is  not 
necessary  to  have  this  closed  form  to  obtain  numerical 
solutions  to  the  problem  of  computing  Bayes  estima¬ 
tors.  This  paper  proposes  a  Gibbs  sampling  approach 
to  computing  them.  The  approach  augments  the  data 
by  using  latent  variables  that  decompose  the  number  of 
the  censored  observations  into  the  possible  number  of 
observations  falling  into  each  interval.  This  augmenta¬ 
tion  facilitates  us  in  specifying  the  conditional  densities 
of  the  survival  functions  given  the  latent  variables.  A 
repeated  sampling  scheme,  that  uses  this  conditional 
density  and  the  conditional  density  of  the  latent  vari- 
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ables  given  the  distribution  function  and  the  data,  al¬ 
lows  us  to  approximate  the  posterior  distribution  of  the 
survival  function. 

Although  we  emphasize  the  doubly  censored  data 
in  this  paper.  The  model  discussed  in  the  next  section 
is  very  general.  It  applies  to  the  data  set  that  includes 
only  completely  observed  data  and  right  censored  da¬ 
ta.  Nonparametric  Bayesian  estimators  in  this  situation 
have  been  derived  by  Susarla  and  Van  Ryzin  (1976), 
and  Ferguson  and  Phadia  (1979).  Kuo  (1991)  com¬ 
puted  these  estimates  based  on  the  data  from  Kaplan 
and  Meier  (1958)  using  the  Gibbs  sampling  approach. 
These  estimates  compare  favorably  to  the  estimates  ob- 
tnin(»H  hv  Snsarln  nnd  Van  Pvzin 

The  model  also  includes  the  situation  that  none 
of  the  completely  observed  data  are  available,  i.e.  all 
incidents  are  either  right  or  left  censored.  The  likeli¬ 
hood  reduces  to  that  considered  in  the  quantal  bioassay. 
Gelfand  and  Kuo  (1991)  studies  the  sampling  based  ap¬ 
proach  to  this  problem.  In  addition  to  the  Dirichlet 
process  prior,  they  also  consider  a  product  of  beta  pri¬ 
or.  They  also  generalize  their  results  to  polytomous 
response. 

Section  2  discusses  the  model.  Section  3  describes 
the  Gibbs  sampling  approach.  An  example  using  the 
data  set  in  Turnbull  (1974)  is  given  in  Section  4. 

2  The  Model 

The  model  is  basically  the  one  studied  by  Turnbull 
(1974).  'lYirnbull  proposes  a  self-consistent  algorithm 
for  computing  the  generalized  maximum  likelihood  es¬ 
timators.  This  paper  adds  the  Dirichlet  process  prior 
to  the  model. 

Let  Ti,T2,  -  •  •  ,T„  denote  the  true  survival  times 
of  n  individuals  that  could  be  observed  precisely  if 
no  censoring  were  present.  The  TJ  are  independent 
and  identically  distributed  with  distribution  F;  that 
is,  F(  t)  =  Fro6(  T  <  t)  for  t>  0.  We  consider  the  case 
that  not  all  7)  are  observed  precisely.  For  each  i,  we  as¬ 
sume  that  there  are  “windows”  of  observations  Lj  and 
Ui  {Li  <  Ui)  that  are  either  fixed  constants  or  random 
variables  independent  of  the  {Ti}.  We  observe 

Xi  =  max  (  min{Ti,  Ui),  L,]. 

Moreover,  for  each  item,  we  also  know  whether  it  is  left 
censored  (late  entry)  with  A,  =  L,,  or  right  censored  (a 


loss)  with  Xi  =  Ui,  or  a  precisely  observed  time  (death) 
with  Xi  =  Ti- 

Usually,  items  are  examined  at  discrete  times,  for 
example,  monthly.  We  can  assume  there  is  a  natural 
discrete  time  scale  0  <  ti  <  <  ■  ■  ■  <  t,n,  and  the  ob¬ 

served  deaths  are  classified  into  one  of  the  m  intervals 
(0,  ti),  {tu  tz],  •  •  • ,  {tm-i,  tm]-  Let  6i  denote  the  number 
of  observed  deaths  in  the  period  (ti-i,  ti],  Pi  denote  the 
number  of  late  entries  at  age  t,,  and  Aj  denote  the  num¬ 
ber  of  losses  at  tj.  It  is  assumed  that  the  late  entries  Pi 
all  occur  at  the  end  of  age  period  (ti_i,  ti]  and  the  losses 
Ai  all  occur  at  the  beginning  of  {U,  ti+i).  The  data  can 
be  summarized  by  the  following  tabulation: 

Type\age _ (0,  tij  (ti,fej  •••  (tm-i,  M 

Deaths  62  •  •  •  6m 

Late  entries  (<)  pi  /t2  •  ■  Pm 

Losses  (>)  Ai  A2  •  •  •  Am 

Let  Pj  =  P{tj)  =  1  -  F{tj)  denote  the  survival 
function  evaluated  at  .  The  likelihood  function  is  pro¬ 
portional  to 

n(F;-l  -  F;)^(l  - 

Let  %  =  Fj_i  -  Fj  for  j  =  1 ,  •  •  • ,  m  and  let  gm+i  =  Pm- 
The  Ferguson’s  process  prior  assumes  that  the  distri¬ 
bution  of  the  q's  is  the  Dirichlet  distribution 

m+1 

^{^  =  Cl[{q,)’^-\ 

>  =  1 

where 

aj  =M{Fo{tj)  -  Fo((,_i)), 
for  j  =  1,  ■  •  • ,  m 1,  with  Fo(im+i)  =  1,  and 

^  _  P{M) 

The  posterior  distribution  of  the  q=  {q\,qi,  -  ,  qm\ 

qm+i)  is  a  mixture  of  Dirichlet  distributions.  The  result- 
s  of  Antoniak  (1974)  can  be  used  to  derive  this  mixture. 
The  next  section  will  develop  the  Gibbs  sampling  tech¬ 
nique  to  approximating  this  mixture. 

3  Gibbs  Sampling 

To  employ  the  Gibbs  sampling  technique,  we  need 
to  introduce  the  latent  variables  that  decompose  the 
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numbers  of  losses  and  late  entries  into  the  numbers 
of  observations  belonging  to  individual  intervals.  Let 
J  ^2j  )  •  •  •  )  denote  the  random  variables  that  count 
the  number  of  observations  in  m  that  might  fall  in  the 
intervals  (0,  ti),  (ti,  I2),  •  •  • , (^-i,  respectively.  Ob¬ 
serve  m  =  12i=i^h‘  Moreover,  let  Zj+ij,- ■  ■  ,Zm+ij 
denote  the  number  of  observations  in  Ay  that  might  fall 
in  the  intervab  (^,  ^+1],  •  •  ■ ,  (tm-i,  respec¬ 

tively.  Observe  Ay  = 

Our  objective  is  to  obtain  the  posterior  distribution 
of  the  q  given  the  data.  To  apply  the  stochastic  aug¬ 
mentation  idea  discussed  in  Tanner  and  Wong  (1987) 
and  in  Gelfand  and  Smith  (1990),  we  can  sample  from 
two  densities  recursively.  The  first  density  is  the  poste¬ 
rior  density  of  the  g  given  the  Zs  and  the  data,  which 
is  an  updated  Dirichlet  distribution  depending  only  on 
the  Zs.  The  second  one  is  the  postrior  density  of  the 
Zs  given  the  q  and  the  data,  which  is  the  density  of  a 
product  of  multinomial  distributions. 

Suppose  at  the  ith  iteration  step  of  the 
Gibbs  sampling,  we  have  the  probabilities  ^  = 
((/i>  <?2>  •  •  • .  9m-(-i).with  q;  =  1,  where  represents 

an  estimate  of  q|.  Then  we  can  update  the  Z  variables 
from  the  multinomial  distributions.  That  is,  for  each  j, 
3  =  1,  •  •  ■ ,  m,  we  sample  Z{y  ‘ ,  •  •  • ,  2y+^  from  the  multi¬ 
nomial  distribution  with  sample  size  /xy  and  parame¬ 
ters  r(y ,  •  •  • ,  rV ,  where  =  <^/  for  /  =  1 ,  •  •  •  ,y. 

Similarly,  we  sample  ,  *  •  • ,  from  the  multi¬ 

nomial  distribution  with  sample  size  Ay  and  param¬ 
eters  where  rjy  =  qf/Y^^hqi  for 

i  =y  -f  1,-  •  • ,  m-l- 1. 

Having  sampled  the  Z  random  variables,  we  can 
update  the  q  variables  by  the  Dirichlet  distribution.  Let 
us  compute,  for  each  i,f  =  1,  -  •  • ,  m-(- 1, 

j=i 

Then  we  could  sample  from 

the  Dirichlet  distribution  with  parameters  ( •  •  • , 
}^^\).  Now  we  use  the  updated  q's  to  continue  sam¬ 
pling  until  the  /th  step. 

By  starting  independent  initial  choices  of  the 
we  can  also  replicate  the  iterations  1/  times.  Af¬ 
ter  1/  replications  each  to  the  /th  iteration,  we  have 
qf,. •  •  • , qi,+i,„  and  •  •  • ,  for  s  =  l, •  •  • ,«/. 

The  posterior  distribution  of  qi  for  1  =  1,  •  •  ■ ,  m-H  can 


be  approximated  by 

V  m+1 

F(qjldafa)  =  t/-^  ^  Beta(J^,  ^  Ijf,), 

»=i  ijy 

where  Beta{a,0)  denotes  the  beta  density  with  param¬ 
eters  a  and  p.  Then  the  posterior  estimate  of  the  gj  can 
be  given  by 

,=i  2^1=1 

The  posterior  standard  error  (S.E.)  of  ®  and  the 
naive  posterior  confidence  interval  for  the  qi  can  be  com¬ 
puted  similarly  from  the  replicated  samples. 

The  numbers  1  and  arc  selected  to  achieve  con¬ 
vergence  to  smooth  estimates.  We  can  fix  a  number  of 
u,  plot  the  posterior  densities  of  qj  given  the  other  q’s 
(beta  distributions)  for  two  different  iteration  numbers, 
for  example,  5  units  apart.  We  increase  the  iteration 
numbers  until  the  two  densities  come  close  to  each  oth¬ 
er.  Then  we  increase  the  number  of  replications  for  the 
final  run.  Choice  of  7  determines  the  convergence  of  the 
density  estimates  to  the  actural  marginal  posterior  den¬ 
sity  at  an  exponential  rate  (Gcman  and  Geman,  1984; 
Tanner  and  Wong,  1987).  The  order  of  convergence  for 
the  replications  is  I'he  standard  error  of  the 

mean  and  the  confidence  intctNals  from  the  replications 
could  also  help  ixs  in  selecting  the  desired  1/. 


4  Numerical  Examples 

The  data  set  taken  from  Turnbull  (1974)  is  summarized 
in  the  following: 

Type  of  obs.\  age  (0,  ti)  (ti,fe)  (fe,  fe)  {ta,U] 
Deaths  12  6  2  3 

Late  entries  2  4  2  5 

Losses  3  2  0  3 

The  likelihood  function  is 

H?)  =  qPq?{q5  +  qj  +  qii  +  *)*  x 

q§(qi  +  ?-')'*(«!  +  qi  +  qs)^  X 
qsCqi  +  92  +  qs)'^  x 
qJCqi  +  *  +  qj  +  94)®qs- 
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Table  1:  Gibbs  Approximation  to  the  Bayes  Estimates 
for  aQ)  =  .00001,1  =  l,  --,5 


Statistics  \  Cell 

1 

2 

3 

4 

5 

9 

.462 

.243 

.084 

.116 

.095 

S.E. 

.001 

.002 

.001 

.001 

.001 

Pi 

.538 

.295 

.211 

.095 

0 

Table  2:  Gibbs  Approximation  to  the  Bayes  Estimates 


for  a{l)  =  1,1  =  1,' 

Statistics  \  Cell 

.,5 

1 

2 

3 

4 

5 

9 

.431 

.237 

.100 

.126 

.106 

S.E. 

.004 

.005 

.003 

.003 

.002 

Pi 

.569 

.332 

.232 

.106 

0 

We  generate  all  the  Zy  variables  as  described  in 
Section  3.  For  example,  let  Zn  =  2  and  Z21  +  Z31  + 
Z41  +  Z51  =  3.  We  generate  the  Z21, •••,Zsi  varlahles 
from  the  multinomial  MiV(3,  f^i,  rsi,  r4i;  t^i)  distribu¬ 
tion,  etc.  If  there  is  only  one  cell  in  the  multinomial 
distribution,  we  just  let  the  corresponding  Z  variable 
be  the  frequency  count  of  that  cell.  If  the  number  of 
count  is  zero  for  a  group  of  cells,  then  all  its  summands 
are  set  to  zero. 

Tables  1-3  exhibit  9, 1  =  !,•••, 5,  the  Bayes  esti¬ 
mates  approximated  by  the  Gibbs  samples  with  7  =  10 
and  u  =  1000.  The  estimated  posterior  standard  errors 
(S.E.)  constructed  from  the  replicated  samples  are  abo 
given.  The  naive  posterior  95%  coverage  intervals  for  gj 
can  be  obtained  using  9  d:  1.96S.E.  The  last  row  sum¬ 
marizes  the  estimates  of  the  9  in  terms  of  the  7^.  These 
values  can  be  compared  with  the  generalized  maximum 
likelihood  estimates  computed  by  IXirnbull,  which  are 
.538,  .295,  .210,  .095  and  0.  Three  sets  of  prior  parame¬ 
ters  are  chosen:  (1)  a(l)  =  .00001;  (2)  a{l)  =  1;  and  (3) 
a{l)  =  5  for  1  =  1, •  •  •  ,5.  The  first  set  mimics  a  very  s- 
mall  prior  sample  size.  The  second  set  selects  a  uniform 
prior  on  the  {9}.  The  last  one  illustrates  strong  prior 
influence  which  assigns  uniform  weight  5  over  each  of 
the  cells.  The  results  confirm  our  expectations  that  the 
Bayes  estimates  from  the  case  (1)  are  very  close  to  the 
estimates  produced  by  Turnbull. 


Table  3:  Gibbs  Approximation  to  the  Bayes  Estimates 


for  a{l)  =  5,1  =  1, 

Statistics  \  Cell 

...,5 

1 

2 

3 

4 

5 

9 

.354 

.226 

.137 

.148 

.135 

S.E. 

.001 

.001 

.001 

.001 

.000 

Fi 

.644 

.420 

.283 

.134 

0 
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Abstract 


The  Generalized  Pcireto  Distribution  (GPD)  is  a 
two-parameter  family  of  distributions  which  can  be  used 
to  model  exceedences  over  a  threshold.  Maximum  like¬ 
lihood  parameter  estimates  are  preferred  since  they  are 
asymptotically  normal  and  asymptotically  efficient.  Nu¬ 
merical  methods  are  required  for  maximizing  the  log- 
likelihood  since  the  minimal  sufficient  statistics  are  the 
order  statistics  and  there  is  no  obvious  simplification  of 
the  nonlinear  likelihood  equation.  An  algorithm  is  given 
to  compute  GPD  maximum  likelihood  estimates  by  re¬ 
ducing  the  two-dimensional  numerical  search  for  the  ze¬ 
ros  of  the  gradient  vector  to  a  one-dimensional  numerical 
search. 


1.  Generalized  Pareto  Distribution 


A  random  variable  X  is  defined  to  have  a  General¬ 
ized  Pareto  Distribution  (GPD),  with  parameters  k  and 
a  such  that  — oo  <  k  <  <x>,  a  >  0,  if  the  cumulative 
distribution  function  is  given  by 


1  - 

\  «  J 

I" 

k  <  0,  X  > 

FGPD(*;^,a)  =  < 

1  - 

ib  =  0,  X  > 

1  - 

V  a  > 

i/k 

1  ’ 

ib  >  0, 

0  <  X  <  afk. 


The  density  function  is  given  by 


(K-t) 


/gpd(®;^>“)  —  { 


1/, 


a  \  a 


Jb  <  0,  X  >  0 
ib  =  0,  X  >  0 
/b>0, 

0  <  X  <  o/i, 


and  the  quantile  function,  Q(u)  =  F'^w).  is  given  by 

(?gpd(w;  k,a)  =  -a  g{\-  «;  k). 

where  g(  )  is  the  power  transformation  (also  called  the 
Box-Cox  transformation),  defined  for  2  >  0  by 


=  S 


I  In  2, 


A  #0 

A  =0. 


Pickands  (1975)  introduced  the  GPD  as  a  two- 
parameter  family  of  distributions  for  exceedences  over 
a  threshold.  The  parameters  of  the  GPD  are  a,  the  scale 
parameter,  and  ib,  the  shape  parameter.  Three  special 
cases  of  the  GPD  are; 

(i)  if  ib  =  1  the  distribution  is  Uniform  (0,a); 

(ii)  if  ib  =  0  the  distribution  is  Exponential  (1/a); 

(iii)  if  ib  <  0  the  distribution  is  Pareto. 

Maximum  likelihood  estimation  of  the  parameters 

(ib,a)  has  been  considered  by  DuMouchel  (1983),  Davi¬ 
son  (1984),  R.  L.  Smith  (1984,1987),  J.  A.  Smith  (1986), 
and  Joe  (1987).  R.  L.  Smith  (1984)  showed  that  under 
certain  conditions  for  regularity  the  maximum  likelihood 
estimates  are  asymptotically  normal  and  asymptotically 
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efficient.  If  (fcn.fln)  denote  the  maximum  likelihood  es¬ 
timates,  then  for  /fe  <  I ,  as  n  — >  oo 


is 


[(1-^)2 

a(l-it) 


a{l-k) 
2a^{l  -  k) 


The  maximum  likelihood  estimates  must  be  derived 
numerically  since  the  minimal  sufficient  statistics  for  the 
GPD  are  the  order  statistics  and  there  is  no  obvious 
simplification  of  the  nonlinear  likelihood  equation. 

Hosking  and  Wallis  (1987)  proposed  a  modified 
Newton-Raphson  algorithm  to  find  the  maximum  of  the 
log- likelihood.  They  also  propose  method  of  moments 
and  method  of  probability-weighted  moments  as  alter¬ 
native  parameter  estimators  for  the  GPD  when  a  reduc¬ 
tion  of  the  parameter  space  to  —  ^  <  P  <  2  reasonable. 
These  alternative  estimators  are  inefficient,  but  are  eas¬ 
ier  to  compute  than  the  maximum  likelihood  estimates. 

In  this  paper  an  algorithm  for  computing  the 
maximum  likelihood  estimates  is  presented.  The  two- 
dimensional  numerical  search  for  the  zeros  of  the  gra¬ 
dient  of  the  GPD  log-likelihood  is  reduced  to  a  one¬ 
dimensional  numerical  search.  This  simplification  is  due 
to  a  reparameterization  pointed  out  by  Davison  (1984). 


2.  Computing  Maximum  Likelihood  Parameter 
,  Estimates 

1  Suppose  X  =  {Xi,...,X„}  is  a  random  sample 

from  the  GPD  with  largest  value  X{n]n).  The  log- 
[  likelihood  is  given  by 

'Cgpd(^>“;X)  = 

I  -n  In  g  -b  -  1^  In  ^1  -  ■  fc  <  0,  a  >  0 

!  j  n 

I  -nlna - I;  =  0,  a  >  0 

! 

i  -nlna  +  -  1^  ^In  ^1  -  ,  I:  >  0, 

I  I  g  >  kX{n-,n). 

\ 

I  If  I:  >  1,  there  is  no  maximum  likelihood  estimate 

since  for  any  ik  >  1, 

lim  £GPD(ki  a;X)  =  00. 

o/t— >X(n;n)+ 

[  In  order  to  obtain  a  finite  maximum  of  the  GPD  log- 

likelihood,  the  constraint  ifc  <  1  must  be  imposed.  There¬ 
fore,  computing  the  GPD  maximum  likelihood  estima¬ 
tors  is  an  optimization  on  the  constrained  space 

A  =  {k  <  0,a  <0}  U  {0  <  dr  <  l,a/k  >  X(n;n)}. 


There  are  two  values  of  (ik,  a)  which  must  be  investi¬ 
gated  to  compute  the  GPD  maximum  likelihood  estima¬ 
tor.  The  first  is  the  local  maximum  of  the  log-likelihood 
on  the  space  A.  The  second  is  at  the  boundary  of  A 
where  ik  =  1. 

2.1.  Local  Maximum  on  A.  To  compute  the  local 
maxima  on  the  space  A,  consider  the  gradient  vector 
of  the  GPD  log-likelihood  given  in  the  Appendix.  The 
solution  to  the  simultaneous  equations  may  be  simplified 
and  written  as 


d£GPD(k,a;X)  _ 
dk 

d£GPD(k,a;X) 
da 


kXj 

a 

,  -r 


!  +  (!/«) S’” 


=  1 


«=1 


The  bivariate  search  for  the  zeroes  of  the  gradient 
vector  over  A  can  be  reduced  to  a  univariate  search  since 
the  second  equation  is  a  closed  form  representation  for 
the  estimator  of  k  given  the  ratio  k/a,  and  the  first  equa¬ 
tion  depends  only  on  k/a.  Therefore,  local  maxima  of 
the  log-likelihood  of  the  GPD  correspond  to  zeros  of  the 
function 


h{0)  = 


i  +  (i/«)ENi-^xo 

i  =  l 
n 

(i/n)5;(i-exo-' 


i=l 


(2.1) 


1, 


with  domain 


B={d<  l/A'(n;Ti)}. 


(2.2) 
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However,  it  is  important  to  recognize  that  not  every 
zero  of  h(0)  corresponds  to  a  zero  of  the  gradient  vec¬ 
tor  of  the  GPD  log-likelihood.  Therefore,  while  reduc¬ 
ing  the  bivariate  search  to  a  univariate  search  provides 
the  benefit  of  simplified  computation,  it  comes  with  the 
complication  of  extraneous  zeros. 

For  example,  notice  that  /i(0)  =  0.  Clearly  then 
0  =  0  is  a  zero  of  h{6).  However,  0  =  0  corresponds  to 
it  =  0  in  the  log-likelihood  equations  and  it  can  be  shown 
that  the  gradient  vector  at  fc  =  0  has  elements 


dCGPo{k,a,X) 

hm - rr - 

t— 0  dk 

lim 


«  \r2  ^  V 

X  A* 

i=l  t=l 


which  are  equal  to  zero  if  and  only  if  (l/n)12r=i  ~ 

2X^.  Therefore,  the  zero  0  =  0  does  not  correspond  to 
a  local  maxima  of  CGPD{k,a-,X.). 

The  presence  of  these  extraneous  zeros  causes  two 
significant  complications.  First,  an  algorithm  must 
search  the  space  B  for  more  than  one  zero.  In  fact,  since 
it  is  known  0  =  0  does  not  correspond  to  a  local  max¬ 
ima  of  CGPDik,a-,X),  the  algorithm  must  be  designed 
to  avoid  numerical  convergence  to  0  =  0.  Second,  every 
zero  of  h(0)  must  be  evaluated  to  determine  if  it  corre¬ 
sponds  to  a  local  maxima,  local  minima,  or  saddle  point 
of  £gpd(^)<*;X),  or  an  extraneous  zero  of  /i(0). 

The  following  theorem  states  several  properties  of 
/i(0)  which  are  useful  in  formulating  an  algorithm  for 
determining  zeros  of  h{9). 


Theorem:  Consider  the  function  /i(0)  given  in  (2.1)  de¬ 
fined  on  the  space  B  given  in  (2.2).  Then; 


lim  h(0)  =  -oo 

h{0)  <  0  for  all  0<0l  = 
h'{0)  = 

i|(l/n)X^(l-0X,)-^- 

n 

(l/n)5;in(l-0X.) 


2[X(l;n)-X] 
[X(l;n)P 

n 

(l/n)5;(l-0X.)-‘ 
•  =  1 


n2 


•=1 


f=l 


•  (l/n)£(l-0XO-‘-(l/n)  ^(1-0X0 

«=rl 

h'(0)  =  0 

nO)  =  (l/n)^X.2-2X^ 


-2 


i=l 


The  first  result  indicates  an  upper  bound  for  any 
zero  of  h{0)  is  given  by  0u  =  1/X(n;n).  Since  this  is 
a  limiting  result,  an  algorithm  can  use  0u  —  e  for  some 
f  >  0  as  the  upper  bound.  The  second  result,  provided 
by  an  anonymous  referee,  provides  a  lower  bound,  0l, 
for  any  zero  of  h{0).  Coupling  these  two  results  with  the 
fact  that  0  =  0  is  an  extraneous  zero,  an  algorithm  must 
divide  the  space  B  into  (0l,O)  and  (O,0u)  and  numeri¬ 
cally  search  for  zeros  of  h{0)  on  these  two  bounded  in¬ 
tervals.  Because  bounds  are  known,  modifications  of  the 
Newton-Raphson  zero  search  algorithms  can  be  made 
which  limit  step  size  so  that  iterative  solutions  remain 
within  the  known  boundaries. 

The  third  result  is  the  derivative  of  h(0),  required 
for  the  Newton-Raphson  algorithm  to  search  for  zeros 
of  h(0).  The  fourth  result  indicates  that  the  extraneous 
zero  of  h{0)  given  by  0  =  0  is  either  a  local  maxima  or 
local  minima  of  h{0). 

The  fifth  result  can  be  used  to  determine  whether 
A(0)  is  a  local  maxima  or  local  minima.  If  h"(0)  >  0 
then  there  are  roots  on  (0^,0)  where  jn  is  an  odd 
integer  and  there  are  jf  roots  on  (O,0u)  where  jp  is  an 
odd  integer.  This  follows  since  h"(0)  >  0  implies  that 
for  some  t  >  0,  h{0  —  e)  >  0,  and  since  h(0L)  <  0, 
then  the  number  of  zeros  on  (0i,,O)  given  by  jn  must 
be  an  odd  integer.  The  argument  for  jp  odd  is  similar. 
In  the  data  sets  used  investigating  the  GPD  maximum 
likelihood  estimators,  it  appears  that  =  jp  =  1. 

If  h”(0)  <  0  then  there  are  j„  roots  on  (0t,  0)  where 
j„  is  zero  or  an  even  integer  and  there  are  jp  roots  on 
(O,0u)  where  jp  is  zero  or  an  even  integer.  This  follows 
since  h"(0)  <  0  implies  that  for  some  e  <  0,  h{0  —  c)  <  0, 
and  since  h(0t)  <  0,  then  either  there  exist  no  zero  of 
h{0)  on  {01, 0)  or  the  number  of  zeros  on  (0l,  0)  given  by 
j„  must  be  an  even  integer.  The  argument  for  jp  either 
zero  or  an  even  integer  is  similar.  In  the  data  sets  used 
investigating  the  GPD  maximum  likelihood  estimators, 
it  appeeurs  that  in  many  cases  j„  —  jp  =  0.  This  result 
agrees  with  the  finding  in  Hosking  and  Wallis  (1987) 
indicating  that  in  many  cases  with  F  >  0  and  n  <  25  the 
GPD  maximum  likelihood  estimators  do  not  exist.  The 
remaining  data  sets  in  the  investigation  indicated  that 
either  j„  =  0,  jp  =  2  or  j„  =  2,  jp  =  0  or  j„  =  2,  jp  =  2. 

The  possible  existence  of  multiple  zeros  of  h{0)  on 
B  complicates  the  numerical  search,  but  an  algorithm 
can  be  designed  to  find  these  multiple  zeros. 

Each  zero  of  h{9)  indicates  a  candidate  for  the  local 
maxima  of  the  log-likelihood.  For  each  of  the  ■+  jp 
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zero(s),  denoted  by  compute 

ki=(l/n)f2\n(l-ef^'>Xi) 

1=1 


ki 


This  value  must  be  evaluated  using  the  Hessian  ma¬ 
trix  of  the  GPD  log-likelihood  given  in  the  Appendix  to 
determine  if  it  is  a  local  maxima,  local  minima  or  saddle 
point  of  the  GPD  log-likelihood.  The  point  is  a 

local  maxima,  and  therefore  considered  a  candidate  for 
the  GPD  maximum  likelihood  estimator,  if  the  Hessian 
matrix  evaluated  at  the  estimators  is  negative  definite. 

The  pair  (iti,a,)  which  has  the  largest  value  of 
£GPD(^ii®tiX)  is  identified  as  the  local  maximum  on 
the  space  A  and  will  be  denoted  by  (km,am)- 

2.2  Boundary  Maximum  on  A.  Any  local  maxima 
of  the  GPD  log-likelihood  on  the  domain  A  must  exceed 
the  log-likelihood  evaluated  at  the  boundary  in  order  to 
be  the  maximum  likelihood  estimator.  Hence,  the  sec¬ 
ond  value  which  must  be  investigated  is  at  the  bound¬ 
ary  of  A  where  k  =  1.  Given  k  =  1,  a  >  X{n;n) 
then  =  — nlna.  Therefore  the  boundary 

maximum,  denoted  by  (ki,at,),  is  given  by  Arj  =  1  and 
ai,  =  X{n;n).  The  problem  is  complicated  by  the  opti¬ 
mization  being  taken  over  an  open  set,  but  it  is  treated 
as  a  maximum  taken  over  a  closed  set. 

The  GPD  maximum  likelihood  estimator,  denoted 
by  {k,a),  is  then  given  by  the  local  maximum  (ibmi<»m) 
if  CGPD{km,<im',^)  >  -nlnA(n;n)  and  is  given  by 
the  boundary  maximum  (ibi,aj)  if  £GPD(^miam;X)  < 
— nlnX(n;n). 

If  no  local  maximum  is  found,  then  there  is  no  GPD 
maximum  likelihood  estimate  an.^.  the  alternative  esti¬ 
mators  given  by  Hosking  and  Wallis  (1987)  are  recom¬ 
mended. 


On  the  space  A,  the  Hessian  matrix  of  the  GPD  log- 
likelihood  has  elements 


d^CGPD(k,a;X)  _n  3^  ,  2  kXi\ 

ap  ~k^\  Jty  V  a) 

a^£GPD(ifc,a;X)  _  1  ^ 


da^ 

d^^GPoik 


.«;X) 

1  ~k^a  ka\k  V  a  / 
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ABSTRACT. 

This  paper  discusses  asymptotic  efficiency  of  the 
maximum  likelihood  estimator  of  the  parameters  of  the 
M/G/1  queueing  system  for  full  likelihood  and  reduced 
likelihood  functions.  The  efficiency  of  the  maximum 
likelihood  estimator  of  the  reduced  likelihood  function 
relative  to  full  likelihood  function  is  derived. 

1.  INTRODUCTION. 

Clarke  (1957)  discussed  the  estimation  problem 
of  traffice  intensity  for  M/M/ 1  queueing  system  using 
maximum  likelihood  principles.  The  problem  of  statis¬ 
tical  inference  for  birth  and  death  processes  was  con¬ 
sidered  as  Markov  processes  by  Wolff  (1965).  A  large 
sample  theory  based  on  maximum  liklihood  theory  for 
Markov  processes  developed  by 

Billingsley  (1961)  was  applied  to  make  inference  for 
arrival  and  service  rates.  Jenkins  (1972)  estimated  the 
maximum  likelihood  estimate  of  mean  waiting  time  in 
the  simple  M/M/1  queue  under  conditions  of  incom¬ 
plete  information.  In  1981,  Basawa  and  Prabhu  con¬ 
sidered  the  single  server  queueing  model  and  obtained 
estimates  for  interarrival  and  service  times  distribution 
functions  without  assuming  the  steady-state.  In  this 
paper,  asymptotic  efficiencies  of  the  estimators  for  the 
M/G/1  queueing  model  are  derived  for  full  likelihood 
and  reduced  likelihood  functions  based  on  Lehmann’s 
(1983)  work.  Asymptotic  relative  efficiency  of  the  es¬ 
timator  is  obtained  as  the  square  of  the  correlation 
coefficient  between  estimators. 

2.  ESTIMATION  PROCEDURES. 

Let  interarrival  and  service  times  be  independent, 
identically  distributed  random  variables.  Their  den¬ 
sities  are  defined  by  f(t,0)  and  g(x,<fi),  respectively. 


where  0  and  ^  are  the  parameters  to  be  estimated. 
Denote 

Interarrival  times  {tk,ifc  >  1} 

and 

Service  times  {xt,ib  >1}. 

Initially  customers  eurive  at  t  equals  zero  and  ob¬ 
serve  the  queue  until  the  first  n  customers  departed. 
Let  the  service  times  of  these  customers  be  ,  13  >  •  •  • . 
Let  nih  departure  occur  at  D„  time.  Observe  the  in¬ 
terarrival  times  of  all  customers  during  the  interval 
(0,  D„).  Let  their  interarrival  time  be  U ,  <2, . . . ,  tjv^ , 
where 

=  NA(Dn)  =  max(*  :  <1  +  <2  +  •  •  •  +  4  <  Dn). 


Na  >  n. 

Likelihood  function  for  estimating  the  parameters 
0  and  <f>  is  given  by 

=  ^n/(<.;»)  j  (1  -  F{Zn;0)) . 

(1) 

where  (1  —  F{Zn ;  0))  corresponds  to  the  incomplete  ar¬ 
rival  interval  when  sampling  is  terminated  at  the  epoch 
D„;  and  Z„  =  U- 

Basawa  and  Prabhu  (1981)  considered  the  follow¬ 
ing  reduced  likelihood  function; 

L%0,<i>) = •  (2) 


Equation  (2)  is  an  approximation  of  equation  (1).  Tak¬ 
ing  the  logarithms  in  equation  (2)  and  differentiating 
with  respect  to  0  and  and  equating  to  zero,  we  get 
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and 


dL“ 

d<l> 


(4) 


t=i 


Let  and  4>%  are  the  estimates  of  6  and  ^  based 
on  reduced  likelihood  function.  Asymptotic  properties 
of  these  estimators  are  given  by  Basawa  and  Prabhu 
(1981).  Taking  lograithms  in  full  likelihood  equation 
(1),  we  have 

Na  fi 

L' =  \nL(6,<l,)  + 

i=l  .  1=1 

+  ln(l-F(Z„  ;<?)).  (5) 

Differentiating  equation  (5)  partially  with  respect  to  6 
and  4)  and  equating  to  zero,  we  have 


dV 

ae 


Na 


i=l 


and 


where 


dv  ^  a  ,  ,  „ 

H{Zn-,6)=:^\n(l-F{Zn,6)). 


(7) 


Let  On  and  are  the  estimates  of  6  and  ^  based 
on  the  full  likelihood  function  which  can  be  obtained 
by  solving  equations  (6)  and  (7).  Comparing  equations 
(6)  and  (7)  with  equations  (3)  and  (4),  it  is  cleu 

4>n  =  ^rid  On  differs  from  ■ 

It  can  be  shown  in  the  following  particular  case 
that  On  and  0^  are  asymptotically  equivalent. 


Now  considering  equation  (6) 


t  =  l 

+  ^  ln(exp(-Z„/fl))  =  0. 

(10) 

Simplifying  equation  (10),  we  get 

1  ^  Zn  n 

-- + 

t  =  l 

(11) 

Solving  equation  (9),  we  get 

_  eS;  u 

(12) 

Solving  equation  (11),  we  have 

A  _  _  Dn 

”  Na  Na' 

(13) 

3.  ASYMPTOTIC  EFFICIENCY  OF  THE  ES- 
TIMATORS. 

In  this  section,  asymptotic  efficiencies  of  the  esti¬ 
mators  are  derived  for  some  particular  queueing  mod¬ 
els.  Some  of  the  preliminary  results  related  to  asymp¬ 
totic  properties  of  the  estimates  based  on 
Lehmann’s  (1983)  work  are  reviewed. 

Preliminary  Results 

The  following  theorem  establishes  that  any  consis¬ 
tent  root  of  the  Ukelibood  equation  is  asymptotically 
normal  and  efficient. 


M/G/1  queue 

Let  the  probability  density  function  of  mean  in¬ 
terarrival  times  0  be  given  by 

f{t,0)  =  (0)-^  expi-t/O). 

Then  the  distribution  function  is 

F(t,  0)  =  1  —  exp(— </ff). 

Let  the  probability  density  function  of  the  service  time 
be  g{x,<l)).  Consider  equation  (3),  we  have 
Na  « 

X;^ln(0-^exp(-f./5))  =  0.  (8) 

1  =  1 

Simplification  of  the  left  side  of  equation  (8)  yields 

t  =  l 


Theorem  3.1.  Suppose  that  X\,X2,...,Xn  are 
independent,  identically  distributed  and  satisfy  appro¬ 
priate  regularity  assumptions  [Lehmann  (1983), 
pages  406  and  415]  then  any  consistent  sequence  = 
0n(X\,X2,  ...,Xn)  of  roots  of  the  likelihood  equation 
satisfies 

^(fl„_<?)^Ar(o,  (/((?))-'),  (14) 

where  ^ 

1(0)  =  In /(X,0)^  . 

Remark  3.1.  Any  sequence  of  estimators  sat¬ 
isfying  equation  (14)  will  be  said  to  be  asymptotically 
efficient. 

Suppose  that  6n  is  any  consistent  estimator  of  0, 
and  the  assumptions  of  Theorem  3.1  hold,  then  the 
root  On  of  the  likelihood  equation  closest  to  On  is  also 
consistent  [Lehmann  (1983),  Theorem  2.2,  page  430]. 
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The  usual  iterative  methods  for  solving  the  likeli¬ 
hood  equation 

L\e)  =  0 

are  based  on  replacing  the  left  side  by  the  linear  terms 
of  a  Taylor’s  series  expansion  about  an  approximate 
solution  0.  If  0  denotes  a  root  of  the  likelihood  equation 
{L'{B)  =  0)  then  this  leads  to  the  approximation 

0  =  L\0)  w  L'(0)  +  (0-  6)L''(0) 


and  hence 


0=0- 


L'i0) 

L"(0) 


(15) 


the  procedure  is  then  to  iterate  by  replacing  0  by  the 
value  of  0  of  the  right  side  of  the  equation  ( 15)  and  so 


on. 

The  following  theorems  give  conditions  on  0  under 
which  the  resulting  sequence  of  estimators  is  consis¬ 
tent,  asymptotically  normal  and  efhcient. 

Theorem  5.  £  Suppose  that  the  assumptions  of  the¬ 
orem  3. 1  hold  and  that  is  not  only  a  consistent  but 
a  v^-consistent  estimator  of  0,  that  is,  that  \/n(0„  —0) 
is  bounded  in  probability  so  that  0n  tends  to  0  at  least 
at  the  rate  of  Then  the  estimator  sequence 


LVn) 

L"ie„) 


(16) 


is  asymptotically  efRcient,  that  is,  it  satisfies  equation 
(14)  with  6„  in  place  of  0„. 

Theorem  3.3.  Suppose  that  the  assumptions  of 
theorem  3.2  hold  and  that  I{9)  is  a  continuous  function 
of  0.  Then  the  estimator 


+ 


n/(<?n) 

is  also  asymptotically  efficient. 


(17) 


M/M/ 1-queue 


The  estimates  for  M/M/1  queue  with  mean  in¬ 
terarrival  time  0  and  mean  service  time  4>  using  full 
likelihood  and  reduced  likelihood  equations  are 


Na 

i=l 

and  ^ 

=  <  =  («)'*  X]®*- 

i=l 


Theorem  3.3  can  be  applied  (for  fixed  number  of 
observations)  to  show  that  0„  is  asymptotically  efh- 
cient.  [Basawa  and  Prabhu  (1981),  page  479].  Rewrit¬ 
ing  equation  (10),  we  have 


dv 

do 

II 

Na 

0 

Therefore, 

and 

^  H  ^  ^"((^)  ^  exp(-t./0))  +  ^  \a(exp(-Z„/0)) 
Na  , 

^2  ^n. 


1=1 


Therefore, 


_  Z„NX 

=  E(^ta/(.,9))’ 

=  £^Aln((e)-iexp(-t/0)))  =1. 

(Na? 


at»* 


Using  equation  (17),  we  have 

Hence,  equation  (21)  after  simplification  yields 


Thus, 


-_LVr  A.i:L-£lL 


(18) 


Since  0„  =  we  have 

It  can  be  easily  seen  from  Theorem  3.1  that 
-0)»N  (o,  =  iV(0,  02), 

which  implies  that  is  asymptotically  efficient. 


(19) 

(20) 

(21) 

(22) 

(23) 

(24) 

(25) 


4.  ASYMPTOTIC  EFFICIENCY  AND  ITS 
RELATIONSHIP  WITH  CORRELATION  CO¬ 
EFFICIENT. 

Let  0“  be  the  likelihood  equation  estimator  such 

that 

Un  =  M0“„-0)^N{O,<Tl),  (26) 
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where 


Let  On  be  a  y^-consistent  estimator  such  that 
Vn  =  \/n(0n  -0)»  Ar(0,  (tI), 

where 


^2 


_  22  _  ( 


(28) 


(29) 


The  estimate  of  asymptotic  relative  efficiency  (ARE) 
of  Vn  with  respect  to  f/„  is 


e- 


(30) 


If  Z„  — +  0,  then  c  =  1. 

If  Zn  Dn,  then  e  =  0. 

The  following  theorem  establishes  the  relationship 
of  ARE  with  correlation  of  f/„  and  V„ . 

Theorem  4- i  Let  C/„  and  Vn  tend  to  bivariate  nor¬ 
mal  distribution  given  by 

(Til  <Ti2 

<r2i  <r22, 

where 

<Tii  =  er^  and  0-22  =  tr^,  and  tr*  >  o’!-  ' 

Then,  ARE  of  Vn  with  respect  to  f/„  is  given  by 


O-MOC:  ::))■ 


where 


P  = 


<ri2 


(o-n  0-22)^ 

is  the  correlation  coefficient  between  Un  and  K„. 
Proof.  Proof  requires  only  to  show  that 


<^12  =  O’!!  =  <^i- 

Consider 

Z  =  Var[(l  -  a)Un  +  qV„] 

=  a^[(T^  +  (tI  -  20-12]  +  a[-2o-i  -|-  20-12]  -I-  o-J . 

(31) 

The  quantity  Z  is  nonnegative  and  approaches  the 
minimum  value  when  0=0,  since  is  asymptoti¬ 
cally  efficient.  Thus,  for  minimization,  differentiating 
Z  with  respect  to  o,  and  equating  to  zero,  we  have 

II  =  2a{<rl  -i-  <t2  -  20-12)  +  (-20?  -|-  2012)  =  0.  (32) 
Solving  equation  (32),  we  get 


a  = 


gf  -  012 
o?  -f  o|  —  2o 


12 


Since  o  =  0,  we  have 


o?  =  O12. 


(33) 

(34) 


Using  On  =0?  and  022  =  of,  we  have 


But  o\/a\  is  the  ARE  of  Vn  with  respect  to  Un- 

Remark  4-1-  The  theorem  4-1  clearly  indicates 
that  under  sufficiently  regular  conditions,  good  effi¬ 
ciency  of  {tf“}  is  equivalent  to  high  correlation  with 

{6n). 
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